I don't think it would be hard to peel out the robots parser, although
obviously it would need refactoring to live in a more standard library
environment.  If you want to look at it, it is in:

https://svn.apache.org/repos/asf/incubator/lcf/trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/RobotsManager.java

Look for the static class "RobotsData", around line 299.

Karl



On Thu, Jun 2, 2011 at 11:35 AM, Julien Nioche
<[email protected]> wrote:
> Hi Karl,
>
> Maybe a good start would be to identify which parts of your crawler could be
> shared and would not take too much effort to be made generic. I haven't
> looked to the code of the crawler in great details but do you think the
> robots parser would be a good candidate?
>
> Julien
>
> On 2 June 2011 16:23, Karl Wright <[email protected]> wrote:
>
>> Absolutely!
>> We're a bit thin on active committers at the moment, which will
>> probably limit our ability to take any highly active roles in your
>> development process.  But we do have a pile of code which you might be
>> able to leverage, and once there is common functionality available I
>> think we'd all prefer to use that rather than home-grown code.
>>
>> How would you prefer that we proceed?
>>
>> Karl
>>
>>
>> On Thu, Jun 2, 2011 at 11:11 AM, Julien Nioche
>> <[email protected]> wrote:
>> > Hi guys,
>> >
>> > I'd just like to mention Crawler Commons which is a effort between the
>> > committers of various crawl-related projects (Nutch, Bixo or Heritrix) to
>> > put some basic functionalities in common. We currently have mostly a top
>> > level domain finder and a sitemap parser, but are definitely planning to
>> > have other things there as well, e.g. robots.txt parser, protocol handler
>> > etc...
>> >
>> > Would you like to get involved? There are quite a few things that the
>> > crawler in Manifold could reuse or contribute to.
>> >
>> > Best,
>> >
>> > Julien
>> >
>> > --
>> > *
>> > *Open Source Solutions for Text Engineering
>> >
>> > http://digitalpebble.blogspot.com/
>> > http://www.digitalpebble.com
>> >
>>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Reply via email to