Is there a way that the fetcher could be extended not necessarily as a plugin interface per say, but to read an XML document that describes how to handle specific file types?
For example many of the pdf to html, word to html and other applications already translate the content into html source so the fetcher wouldn't need to be extended to do this, however if there was an XML document that describe the program that is called, the variables provided/passed to an external program to handle the translation of the doc into "html" then that would be best. For instance if fetcher sees a pdf it would call the command as definged in the xml file to handle the pdf document and it would just know to parse the results from the converted document - thus allowing your cached copy to be an HTML document like google does. This way extensions could be defined as any program that has input/output (unix way) and not necessarily an plugin that requires java knowledge or re-writes of what is already done elsewhere into java. Heck, i wouldn't mind even having the ability to define seperate indices for each data type and use the distributed search to consolidate these so this way you could have search ftp, pdf, html, word as seperate entities fairly easily :) --- [EMAIL PROTECTED] wrote: > However I am in favor of unix way: a tool should > only > do one task and do it well. The crawlers > (Fetcher.java and > RequestScheduler.java) need only concern themselves > going out > to fetch urls. Currently they do text tripping (on > text/html), > mostly for the purpose of outlink extraction. Since > there are only > a few file formats that have meaningful amount of > embedded links worth > harvesting, the benefit of having a full-blown > plugin system in crawler > (for the sole purpose of outlink extraction) is not > that great. > This is not to say plugin systems are not needed by > Nutch. > I can image plugin systems are used by seperate > tools specialied in > content analysis, clustering, etc. ------------------------------------------------------- This SF.Net email is sponsored by: SourceForge.net Broadband Sign-up now for SourceForge Broadband and get the fastest 6.0/768 connection for only $19.95/mo for the first 3 months! http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
