[ https://issues.apache.org/jira/browse/NUTCH-936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-936: -------------------------------- Patch Info: [Patch Available] > LanguageIdentifier should not set empty lang field on NutchDocument > ------------------------------------------------------------------- > > Key: NUTCH-936 > URL: https://issues.apache.org/jira/browse/NUTCH-936 > Project: Nutch > Issue Type: Bug > Components: indexer > Affects Versions: 1.2 > Reporter: Markus Jelsma > Assignee: Markus Jelsma > Priority: Minor > Fix For: 1.3, 2.0 > > Attachments: NUTCH-936-v12-1.patch, NUTCH-936-v13-1.patch, > NUTCH-936-v13-1.patch > > > For some reason the language identifier plugin sometimes sets an empty value > for the lang field. It is confirmed to occur in 1.2 when parsing a scanned > PDF file which cannot be OCR'd to proper text, resulting in an empty content > field. Anyway, whether it's a problem with the parser or not, the plugin > itself should not add an empty value because the content field can always be > empty. The plugin already checks for a null value and then sets the lang > field to `unknown`, which is fine. But when the lang string is empty, it > should also be set to `unknown`. > This might break clients that have conditional logic on the empty value, but > not on the `unknown` value because it may never have occurred in their set up > and therefore they might not have added `unknown` to their logic. However, it > might seem a little bit overkill to put this proposal behind a configuration > option and let Nutch by default continue to behave as it currently does. Any > thoughts on this one? > Here's the troublesome URL : > http://www.nrc.nl/redactie/binnenland/memo_buza_irak.pdf that returns an > empty content field and an empty lang string in 1.2 and presumably in trunk > and other versions as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.