[
https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kristof updated NUTCH-1406:
----------------------------
Description:
This improvement to the index-metatags plugin (sometimes also refered to
parse-metatags plugin) allows for conversion of selected fields to the Solr
date format and prevents parsing/indexing of metatags that do not contain any
content.
In order to convert the values of selected metatags to Solr date format, you
must specify in nutch-site.xml. The example used is an extended Dublin Core
element dcterms.modified with the seed url http://www.cic.gc.ca/.
dcterms.modified must also be defined in the metatags.names property.
{code}
<property>
<name>metatags.convert</name>
<value>dcterms.modified</value>
<description>For plugin index-metadata: Indicate here the name of the
html meta tag that should be converted to Solr date format.
</description>
</property>
{code}
I read that SimpleDateFormat format is not a robust solution, so this
improvement might have some problems.
So far it worked well for me. Below more details about the changes.
was:
This improvement to the index-metatags plugin (sometimes also refered to
parse-metatags plugin) allows for conversion of selected fields to the Solr
date format and prevents parsing/indexing of metatags that do not contain any
content.
In order to convert the values of selected metatags to Solr date format, you
must specify in nutch-site.xml. The example used is a simple Dublin Core
element dc.date. It must also be defined in the metatags.names property.
{code}
<property>
<name>metatags.convert</name>
<value>dc.date</value>
<description>For plugin index-metatags: Indicate here the name of the
html meta tag that should be converted to date format.
</description>
</property>
{code}
I read that SimpleDateFormat format is not a robust solution, so this
improvement might have some problems.
So far it worked well for me. Below more details about the changes.
> Metatags-index/-parse plugin: conversion to Solr date format and prevents
> parsing/indexing of empty tags
> --------------------------------------------------------------------------------------------------------
>
> Key: NUTCH-1406
> URL: https://issues.apache.org/jira/browse/NUTCH-1406
> Project: Nutch
> Issue Type: Improvement
> Components: indexer, parser
> Reporter: Kristof
> Priority: Minor
> Labels: conversion, date
> Attachments: index-metadata-plugin.patch, index-metatags.jar
>
>
> This improvement to the index-metatags plugin (sometimes also refered to
> parse-metatags plugin) allows for conversion of selected fields to the Solr
> date format and prevents parsing/indexing of metatags that do not contain any
> content.
> In order to convert the values of selected metatags to Solr date format, you
> must specify in nutch-site.xml. The example used is an extended Dublin Core
> element dcterms.modified with the seed url http://www.cic.gc.ca/.
> dcterms.modified must also be defined in the metatags.names property.
> {code}
> <property>
> <name>metatags.convert</name>
> <value>dcterms.modified</value>
> <description>For plugin index-metadata: Indicate here the name of the
> html meta tag that should be converted to Solr date format.
> </description>
> </property>
> {code}
> I read that SimpleDateFormat format is not a robust solution, so this
> improvement might have some problems.
> So far it worked well for me. Below more details about the changes.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira