Re: Comments? (Re: [Nutch-dev] [ nutch-Bugs-989511 ] Patch to reduce whitespace in Summary)

Doug Cutting Wed, 14 Jul 2004 11:56:14 -0700

Sami Siren wrote:

it would be great if new or modified extension point would allow us to add
filters wich have access to textual content of document no matter if the
original was html, pdf, doc or whatever.

We could add a new ParseFilter extension point. It's interface would have just:

  Parse filter(Parse parse);

All implementations could be run on all parses.

But is this really needed?

whitespace removing could be done with one plugin for all (text-) formats. another usecase would be the language identifier, (or some other sort of categorizer). it would be possible to do ngram language identifiaction allready at that point and it again would open possibility to use localized stop word-/profane-/whatever lists, stemmers etc at later stages.

Language id can already be reasonably done by an indexing filter. I don't see any advantage to moving it here. Am I missing something?

And stemming and other word-based operations should be performed in Lucene Analyzers, while indexing. Nutch does not yet permit a plugin here, but this might eventually make sense.

So is whitespace normalization alone enough to justify this? I wonder if instead parser implementations might just use a utility class that removes excess whitespace...

Doug

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: Comments? (Re: [Nutch-dev] [ nutch-Bugs-989511 ] Patch to reduce whitespace in Summary)

Reply via email to