[
https://issues.apache.org/jira/browse/NUTCH-936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-936:
--------------------------------
Description:
For some reason the language identifier plugin sometimes sets an empty value
for the lang field. It is confirmed to occur in 1.2 when parsing a scanned PDF
file which cannot be OCR'd to proper text, resulting in an empty content field.
Anyway, whether it's a problem with the parser or not, the plugin itself should
not add an empty value because the content field can always be empty. The
plugin already checks for a null value and then sets the lang field to
`unknown`, which is fine. But when the lang string is empty, it should also be
set to `unknown`.
This might break clients that have conditional logic on the empty value, but
not on the `unknown` value because it may never have occurred in their set up
and therefore they might not have added `unknown` to their logic. However, it
might seem a little bit overkill to put this proposal behind a configuration
option and let Nutch by default continue to behave as it currently does. Any
thoughts on this one?
Here's the troublesome URL :
http://www.nrc.nl/redactie/binnenland/memo_buza_irak.pdf that returns an empty
content field and an empty lang string in 1.2 and presumably in trunk and other
versions as well.
was:
For some reason the language identifier plugin sometimes sets an empty value
for the lang field. It is confirmed to occur in 1.2 when parsing a scanned PDF
file which cannot be OCR'd to proper text. Anyway, whether it's a problem with
the parser or not, the plugin itself should not add an empty value. The plugin
already checks for a null value and then sets the lang field to `unknown`,
which is fine. But when the lang string is empty, it should also be set to
`unknown`.
This might break clients that have conditional logic on the empty value, but
not on the `unknown` value because it may never have occurred in their set up
and therefore they might not have added `unknown` to their logic.
However, it might seem a little bit overkill to put this proposal behind a
configuration option and let Nutch by default continue to behave as it
currently does. Any thoughts on this one?
> LanguageIdentifier should not set empty lang field on NutchDocument
> -------------------------------------------------------------------
>
> Key: NUTCH-936
> URL: https://issues.apache.org/jira/browse/NUTCH-936
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.2
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Fix For: 1.3, 2.0
>
>
> For some reason the language identifier plugin sometimes sets an empty value
> for the lang field. It is confirmed to occur in 1.2 when parsing a scanned
> PDF file which cannot be OCR'd to proper text, resulting in an empty content
> field. Anyway, whether it's a problem with the parser or not, the plugin
> itself should not add an empty value because the content field can always be
> empty. The plugin already checks for a null value and then sets the lang
> field to `unknown`, which is fine. But when the lang string is empty, it
> should also be set to `unknown`.
> This might break clients that have conditional logic on the empty value, but
> not on the `unknown` value because it may never have occurred in their set up
> and therefore they might not have added `unknown` to their logic. However, it
> might seem a little bit overkill to put this proposal behind a configuration
> option and let Nutch by default continue to behave as it currently does. Any
> thoughts on this one?
> Here's the troublesome URL :
> http://www.nrc.nl/redactie/binnenland/memo_buza_irak.pdf that returns an
> empty content field and an empty lang string in 1.2 and presumably in trunk
> and other versions as well.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.