Re: [Nutch-dev] code for index content of mime type beyond text/html

john Tue, 18 May 2004 14:52:51 -0700

Hi, Doug,

On Tue, May 18, 2004 at 01:37:46PM -0700, Doug Cutting wrote:
> [EMAIL PROTECTED] wrote:
> >
> >For text stripping part, I would not consider there is a total conflict.
> >His is more of handling the content analysis on the fly.
> >Mine is to have that done at late stage with support of meta info saved
> >in FetcherOutput.
> 
> John,
> 
> I like your metadata stuff, and don't want to lose that.  However you 
> make the architectural assumption that only HTML contains links, while, 
> e.g. PDF, msword and even plain text can too.


Just tried to keep the crawler not less capable than the current one.
Yes, links in pdf, doc, etc. should be harvested too. But
had better be done in separate tool.

>
> So if we only want to parse a page once, then I think we need to either 
> do all of the metadata and link extraction at fetch time, or have the 
> fetcher just store the raw content, then do parsing in a separate pass.

I would prefer the latter, have outlink harvesting and other analysis
done by separate tool(s). This way, search engine operator will have
more control. From what I have experienced, pdf and doc files are usually
large in size (10M bytes is not atypical). It is not only slow to parse
but large in memory usage (maybe the free/open parsers can be improved,
but for now we got to use what are available.) This could be a burden
for crawler, especially multiple thread run. The memory need to be reserved 
for crawler to do longer and larger scouts.

> 
> In either case, I think we need a single interface that combines your 
> Textable with Stefan's IContentExtractor.  In particular, I think it 
> should look something like:

I agree. I will get back to this after trying out his patch.

John

> 
> public interface ParsedContentFactory {
>   ParsedContent getParsedContent(FetchListEntry fle, Response response, 
> URL base);
> }
> 
> public interface ParsedContent {
>   String getText();
>   MetaData getMetaData();
>   Outlink[] getOutlinks();
> }
> 
> These will be implemented for each content type.
> 
> Then we need an extension point like:
> 
> public interface DocumentFactory {
>   Document getDocument(String segment, long doc,
>                        FetcherOutput fo, FetcherText text);
> }
> 
> A base implementation would index all of the standard fields, and 
> subclasses could index other metadata.
> 
> Does this sound reasonable?
> 
> Doug
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by: SourceForge.net Broadband
> Sign-up now for SourceForge Broadband and get the fastest
> 6.0/768 connection for only $19.95/mo for the first 3 months!
> http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click
> _______________________________________________
> Nutch-developers mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
> 
__________________________________________
http://www.neasys.com - A Good Place to Be
Come to visit us today!


-------------------------------------------------------
This SF.Net email is sponsored by: SourceForge.net Broadband
Sign-up now for SourceForge Broadband and get the fastest
6.0/768 connection for only $19.95/mo for the first 3 months!
http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] code for index content of mime type beyond text/html

Reply via email to