Re: [Nutch-dev] code for index content of mime type beyond text/html

john Tue, 18 May 2004 12:28:34 -0700

Forgot one thing:

Has anyone run crawler with the plugin on, fetching substantial
amount of urls, say 500,000? How does it perform?


John

On Tue, May 18, 2004 at 01:00:09PM -0700, [EMAIL PROTECTED] wrote:
> Hi, Doug,
> 
> On Tue, May 18, 2004 at 09:32:57AM -0700, Doug Cutting wrote:
> > [EMAIL PROTECTED] wrote:
> > >I have found a way to have the redirect (from log4j to java's built-in) 
> > >work.
> > 
> > Great!  Thanks!
> > 
> > >A minor patch is thus attached. It should be applied only after
> > >the major patch (sent in my ealier mail) has been applied.
> > 
> > Your patch and Stefan Groschupf's plugin patch confict.  How shall we 
> > resolve this?  I really want to get this functionality into Nutch. 
> > Having now worked on the format-conversion problem more than I have, 
> > what do you think of Stefan's plugin mechanism?
> 
> (I have not tried Stefan's patch yet, the following is based on
> my skimming through his sources. I could be wrong.)
> 
> Besides text stripping, my patch provides new capabilities/mechanisms
> at indexing stage and in search output.
> 
> As in its current state, Stefan's plugin does text stripping only.
> 
> For text stripping part, I would not consider there is a total conflict.
> His is more of handling the content analysis on the fly.
> Mine is to have that done at late stage with support of meta info saved
> in FetcherOutput.
> 
> However I am in favor of unix way: a tool should only
> do one task and do it well. The crawlers (Fetcher.java and
> RequestScheduler.java) need only concern themselves going out
> to fetch urls. Currently they do text tripping (on text/html),
> mostly for the purpose of outlink extraction. Since there are only
> a few file formats that have meaningful amount of embedded links worth
> harvesting, the benefit of having a full-blown plugin system in crawler
> (for the sole purpose of outlink extraction) is not that great.
> This is not to say plugin systems are not needed by Nutch.
> I can image plugin systems are used by seperate tools specialied in
> content analysis, clustering, etc.
> 
> Stephan: Your earlier message mentioned that you may want to
> use the plugin system to do some magic stuff. Could you be more
> specific? Is it must be done in crawler?
> 
> If a plugin system must be in crawler, it'd better be
> configurable by nutch-default.txt. User should be able to switch it
> on and off.
> 
> Give me 1 or 2 days, I might be able to offer more comments after
> trying the plugin patch.
> 
> John
> 
__________________________________________
http://www.neasys.com - A Good Place to Be
Come to visit us today!


-------------------------------------------------------
This SF.Net email is sponsored by: SourceForge.net Broadband
Sign-up now for SourceForge Broadband and get the fastest
6.0/768 connection for only $19.95/mo for the first 3 months!
http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] code for index content of mime type beyond text/html

Reply via email to