LanguageIdentifier should not set empty lang field on NutchDocument
-------------------------------------------------------------------

                 Key: NUTCH-936
                 URL: https://issues.apache.org/jira/browse/NUTCH-936
             Project: Nutch
          Issue Type: Bug
          Components: indexer
    Affects Versions: 1.2
            Reporter: Markus Jelsma
            Assignee: Markus Jelsma
            Priority: Minor
             Fix For: 1.3, 2.0


For some reason the language identifier plugin sometimes sets an empty value 
for the lang field. It is confirmed to occur in 1.2 when parsing a scanned PDF 
file which cannot be OCR'd to proper text. Anyway, whether it's a problem with 
the parser or not, the plugin itself should not add an empty value. The plugin 
already checks for a null value and then sets the lang field to `unknown`, 
which is fine. But when the lang string is empty, it should also be set to 
`unknown`.

This might break clients that have conditional logic on the empty value, but 
not on the `unknown` value because it may never have occurred in their set up 
and therefore they might not have added `unknown` to their logic.

However, it might seem a little bit overkill to put this proposal behind a 
configuration option and let Nutch by default continue to behave as it 
currently does. Any thoughts on this one?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to