LanguageIdentifier should not set empty lang field on NutchDocument
-------------------------------------------------------------------
Key: NUTCH-936
URL: https://issues.apache.org/jira/browse/NUTCH-936
Project: Nutch
Issue Type: Bug
Components: indexer
Affects Versions: 1.2
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
Fix For: 1.3, 2.0
For some reason the language identifier plugin sometimes sets an empty value
for the lang field. It is confirmed to occur in 1.2 when parsing a scanned PDF
file which cannot be OCR'd to proper text. Anyway, whether it's a problem with
the parser or not, the plugin itself should not add an empty value. The plugin
already checks for a null value and then sets the lang field to `unknown`,
which is fine. But when the lang string is empty, it should also be set to
`unknown`.
This might break clients that have conditional logic on the empty value, but
not on the `unknown` value because it may never have occurred in their set up
and therefore they might not have added `unknown` to their logic.
However, it might seem a little bit overkill to put this proposal behind a
configuration option and let Nutch by default continue to behave as it
currently does. Any thoughts on this one?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.