[ 
https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13398420#comment-13398420
 ] 

Julien Nioche commented on NUTCH-1406:
--------------------------------------

bq. index-metatags plugin (sometimes also refered to parse-metatags plugin) 

for the sake of clarification this patch is about index-metadata, not 
parse-metatags (which was index-metatags at one point). This confusion explains 
why this patch is definitely wrong.  You're basically replacing a more advanced 
version with the older and more primitive index-metatags (with the added twist 
of date conversion). What you could do instead would be to keep the existing 
MetadataIndexer but specify via configuration the field names that should be 
converted e.g. index.md.date with the values being a comma separated list of 
field names for instance.

                
> Metatags-index/-parse plugin: conversion to Solr date format and prevents 
> parsing/indexing of empty tags
> --------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metadata-plugin.patch, index-metatags.jar
>
>
> This improvement to the index-metatags plugin (sometimes also refered to 
> parse-metatags plugin) allows for conversion of selected fields to the Solr 
> date format and prevents parsing/indexing of metatags that do not contain any 
> content.
> In order to convert the values of selected metatags to Solr date format, you 
> must specify in nutch-site.xml. The example used is an extended Dublin Core 
> element dcterms.modified with the seed url http://www.cic.gc.ca/. 
> dcterms.modified must also be defined in the metatags.names property.
> {code}
> <property>
>       <name>metatags.convert</name>
>       <value>dcterms.modified</value>
>       <description>For plugin index-metadata: Indicate here the name of the 
> html meta tag that should be converted to Solr date format.
>       </description>
> </property>
> {code}
> I read that SimpleDateFormat format is not a robust solution, so this 
> improvement might have some problems.
> So far it worked well for me. Below more details about the changes.
> Please note:
> The attached jar-file was originally taken from NUTCH-809 
> (https://issues.apache.org/jira/browse/NUTCH-809). The plugin and tutorial 
> there do not necessarily match the index-metadata plugin in subversion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to