it would be great if new or modified extension point would allow us to add filters wich have access to textual content of document no matter if the original was html, pdf, doc or whatever.
We could add a new ParseFilter extension point. It's interface would have just:
Parse filter(Parse parse);
All implementations could be run on all parses.
But is this really needed?
whitespace removing could be done with one plugin for all (text-) formats.
another usecase would be the language identifier, (or some other sort of
categorizer). it would be possible to do ngram language identifiaction
allready at that point and it again would open possibility to use localized stop
word-/profane-/whatever lists, stemmers etc at later stages.
Language id can already be reasonably done by an indexing filter. I don't see any advantage to moving it here. Am I missing something?
And stemming and other word-based operations should be performed in Lucene Analyzers, while indexing. Nutch does not yet permit a plugin here, but this might eventually make sense.
So is whitespace normalization alone enough to justify this? I wonder if instead parser implementations might just use a utility class that removes excess whitespace...
Doug
------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
