Is there a way that the fetcher could be extended not
necessarily as a plugin interface per say, but to read
an XML document that describes how to handle specific
file types?

For example many of the pdf to html, word to html and
other applications already translate the content into
html source so the fetcher wouldn't need to be
extended to do this, however if there was an XML
document that describe the program that is called, the
variables provided/passed to an external program to
handle the translation of the doc into "html" then
that would be best.

For instance if fetcher sees a pdf it would call the
command as definged in the xml file to handle the pdf
document and it would just know to parse the results
from the converted document - thus allowing your
cached copy to be an HTML document like google does.

This way extensions could be defined as any program
that has input/output (unix way) and not necessarily
an plugin that requires java knowledge or re-writes of
what is already done elsewhere into java.

Heck, i wouldn't mind even having the ability to
define seperate indices for each data type and use the
distributed search to consolidate these so this way
you could have search ftp, pdf, html, word as seperate
entities fairly easily :)

--- [EMAIL PROTECTED] wrote:

> However I am in favor of unix way: a tool should
> only
> do one task and do it well. The crawlers
> (Fetcher.java and
> RequestScheduler.java) need only concern themselves
> going out
> to fetch urls. Currently they do text tripping (on
> text/html),
> mostly for the purpose of outlink extraction. Since
> there are only
> a few file formats that have meaningful amount of
> embedded links worth
> harvesting, the benefit of having a full-blown
> plugin system in crawler
> (for the sole purpose of outlink extraction) is not
> that great.
> This is not to say plugin systems are not needed by
> Nutch.
> I can image plugin systems are used by seperate
> tools specialied in
> content analysis, clustering, etc.




-------------------------------------------------------
This SF.Net email is sponsored by: SourceForge.net Broadband
Sign-up now for SourceForge Broadband and get the fastest
6.0/768 connection for only $19.95/mo for the first 3 months!
http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to