Forgot one thing: Has anyone run crawler with the plugin on, fetching substantial amount of urls, say 500,000? How does it perform?
John On Tue, May 18, 2004 at 01:00:09PM -0700, [EMAIL PROTECTED] wrote: > Hi, Doug, > > On Tue, May 18, 2004 at 09:32:57AM -0700, Doug Cutting wrote: > > [EMAIL PROTECTED] wrote: > > >I have found a way to have the redirect (from log4j to java's built-in) > > >work. > > > > Great! Thanks! > > > > >A minor patch is thus attached. It should be applied only after > > >the major patch (sent in my ealier mail) has been applied. > > > > Your patch and Stefan Groschupf's plugin patch confict. How shall we > > resolve this? I really want to get this functionality into Nutch. > > Having now worked on the format-conversion problem more than I have, > > what do you think of Stefan's plugin mechanism? > > (I have not tried Stefan's patch yet, the following is based on > my skimming through his sources. I could be wrong.) > > Besides text stripping, my patch provides new capabilities/mechanisms > at indexing stage and in search output. > > As in its current state, Stefan's plugin does text stripping only. > > For text stripping part, I would not consider there is a total conflict. > His is more of handling the content analysis on the fly. > Mine is to have that done at late stage with support of meta info saved > in FetcherOutput. > > However I am in favor of unix way: a tool should only > do one task and do it well. The crawlers (Fetcher.java and > RequestScheduler.java) need only concern themselves going out > to fetch urls. Currently they do text tripping (on text/html), > mostly for the purpose of outlink extraction. Since there are only > a few file formats that have meaningful amount of embedded links worth > harvesting, the benefit of having a full-blown plugin system in crawler > (for the sole purpose of outlink extraction) is not that great. > This is not to say plugin systems are not needed by Nutch. > I can image plugin systems are used by seperate tools specialied in > content analysis, clustering, etc. > > Stephan: Your earlier message mentioned that you may want to > use the plugin system to do some magic stuff. Could you be more > specific? Is it must be done in crawler? > > If a plugin system must be in crawler, it'd better be > configurable by nutch-default.txt. User should be able to switch it > on and off. > > Give me 1 or 2 days, I might be able to offer more comments after > trying the plugin patch. > > John > __________________________________________ http://www.neasys.com - A Good Place to Be Come to visit us today! ------------------------------------------------------- This SF.Net email is sponsored by: SourceForge.net Broadband Sign-up now for SourceForge Broadband and get the fastest 6.0/768 connection for only $19.95/mo for the first 3 months! http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
