I don't think it would be hard to peel out the robots parser, although obviously it would need refactoring to live in a more standard library environment. If you want to look at it, it is in:
https://svn.apache.org/repos/asf/incubator/lcf/trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/RobotsManager.java Look for the static class "RobotsData", around line 299. Karl On Thu, Jun 2, 2011 at 11:35 AM, Julien Nioche <[email protected]> wrote: > Hi Karl, > > Maybe a good start would be to identify which parts of your crawler could be > shared and would not take too much effort to be made generic. I haven't > looked to the code of the crawler in great details but do you think the > robots parser would be a good candidate? > > Julien > > On 2 June 2011 16:23, Karl Wright <[email protected]> wrote: > >> Absolutely! >> We're a bit thin on active committers at the moment, which will >> probably limit our ability to take any highly active roles in your >> development process. But we do have a pile of code which you might be >> able to leverage, and once there is common functionality available I >> think we'd all prefer to use that rather than home-grown code. >> >> How would you prefer that we proceed? >> >> Karl >> >> >> On Thu, Jun 2, 2011 at 11:11 AM, Julien Nioche >> <[email protected]> wrote: >> > Hi guys, >> > >> > I'd just like to mention Crawler Commons which is a effort between the >> > committers of various crawl-related projects (Nutch, Bixo or Heritrix) to >> > put some basic functionalities in common. We currently have mostly a top >> > level domain finder and a sitemap parser, but are definitely planning to >> > have other things there as well, e.g. robots.txt parser, protocol handler >> > etc... >> > >> > Would you like to get involved? There are quite a few things that the >> > crawler in Manifold could reuse or contribute to. >> > >> > Best, >> > >> > Julien >> > >> > -- >> > * >> > *Open Source Solutions for Text Engineering >> > >> > http://digitalpebble.blogspot.com/ >> > http://www.digitalpebble.com >> > >> > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com >
