[
https://issues.apache.org/jira/browse/NUTCH-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alfonso Presa updated NUTCH-1553:
---------------------------------
Description:
May be I'm doing something wrong, but it seems to me that +NUTCH-1434+ patch
only works when using tika's parser. When using parser-html, "robots" metatag
is only populated if parse-metatags plugin is enabled and it's done with the
prefix "metatag.". So parseData.getMeta("robots") returns nothing if not using
tika.
I guess the simplest solution would be to provide a fallback in case
parseData.getMeta("robots") is null and then get
parseData.getMeta("metatag.robots") in that case.
Also dependency of this property with parse-metadata plugin when using
parse-html would be something interesting to document somewhere...
(nutch-default.xml?)
Thanks!
was:
May be I'm doing something wrong, but it seems to me that +NUTCH-1434+ patch
only works when using tika's parser. When using parser-html, "robots" metatag
is only populated if parse-metatags plugin is enabled and it's done with the
prefix "metatag.". So parseData.getMeta("robots") return nothing if not using
tika.
I suppose the simplest solution would be to provide a fallback in case
parseData.getMeta("robots") is null and get parseData.getMeta("metatag.robots")
in that case.
Thanks!
> Property 'indexer.delete.robots.noindex' not working if using parser-html.
> --------------------------------------------------------------------------
>
> Key: NUTCH-1553
> URL: https://issues.apache.org/jira/browse/NUTCH-1553
> Project: Nutch
> Issue Type: Bug
> Components: indexer, parser
> Affects Versions: 1.6
> Reporter: Alfonso Presa
> Priority: Minor
>
> May be I'm doing something wrong, but it seems to me that +NUTCH-1434+ patch
> only works when using tika's parser. When using parser-html, "robots" metatag
> is only populated if parse-metatags plugin is enabled and it's done with the
> prefix "metatag.". So parseData.getMeta("robots") returns nothing if not
> using tika.
> I guess the simplest solution would be to provide a fallback in case
> parseData.getMeta("robots") is null and then get
> parseData.getMeta("metatag.robots") in that case.
> Also dependency of this property with parse-metadata plugin when using
> parse-html would be something interesting to document somewhere...
> (nutch-default.xml?)
> Thanks!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira