Thanks Julien; I found it, strange... Yes, I need to separate Robots Rules Parser, if BIXO agrees...
ManifoldCF current style: 1. Open socket 2. Load 500 kbits (in 2 milliseconds) 3. Speep 998 milliseconds Just because there is user interface where we set bandwidth limit to 500 kbps (probably 50 kbytes) So that it will be hard... I'd like to see HttpClient instead... or, if crawler-commons includes "fetcher", to see that... even better if "fetcher" is rich enough to support POST (there was some interest at Droids) Existing code seems outdated: why external server should allocate resources (TCP and HTTP Handler) which are not used 99.8% of the time? But reusing of Robots Rules is most importnant; Nutch has some prooblems too... Thanks -----Original Message----- From: Julien Nioche [mailto:[email protected]] Sent: June-03-11 7:01 AM To: [email protected]; [email protected] Subject: Re: CrawlerCommons & ManifoldCF There is a link to the discussion group on the main page, becoming a member of the group is pretty straightforward On 3 June 2011 00:36, Fuad Efendi <[email protected]> wrote: > I mean "join button" at http://code.google.com/p/crawler-commons/ > I am well familiar with BIXO and Droids; it will be hard to make minor > changes in ManifoldCF... although it's possible (without "crawler" > part, only "robots rules parser")... > -Fuad > > > -----Original Message----- > From: Fuad Efendi [mailto:[email protected]] > Sent: June-02-11 7:05 PM > To: [email protected]; > [email protected] > Subject: RE: CrawlerCommons & ManifoldCF > > I'd like to join this project but can't find "join" button :) Thanks! > > Fuad Efendi > +1 416-993-2060 > http://www.linkedin.com/in/liferay > > Tokenizer Inc. > http://www.tokenizer.ca/ > Data Mining, Vertical Search > > -----Original Message----- > From: Julien Nioche [mailto:[email protected]] > Sent: June-02-11 11:11 AM > To: [email protected]; > [email protected] > Subject: CrawlerCommons & ManifoldCF > > Hi guys, > > I'd just like to mention Crawler Commons which is a effort between the > committers of various crawl-related projects (Nutch, Bixo or Heritrix) > to put some basic functionalities in common. We currently have mostly > a top level domain finder and a sitemap parser, but are definitely > planning to have other things there as well, e.g. robots.txt parser, > protocol handler etc... > > Would you like to get involved? There are quite a few things that the > crawler in Manifold could reuse or contribute to. > > Best, > > Julien > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
