[ https://issues.apache.org/jira/browse/NUTCH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1140: --------------------------------- Fix Version/s: 1.5 > index-more plugin, resetTitle method creates multiple values in the Title > field > ------------------------------------------------------------------------------- > > Key: NUTCH-1140 > URL: https://issues.apache.org/jira/browse/NUTCH-1140 > Project: Nutch > Issue Type: Bug > Components: indexer > Affects Versions: 1.3 > Reporter: Joe Liedtke > Priority: Minor > Fix For: 1.5 > > Attachments: MoreIndexingFilter.093011.patch > > > From the comments in MoreIndexingFilter.java, the index-more plugin is meant > to reset the Title field of a document if it contains a Content-Disposition > header. The current behavior is to add a Title regardless of whether one > exists or not, which can cause issues down the line with the Solr Indexing > process, and based on a thread in the nutch user list it appears that this is > causing some users to mark the title as multi-valued in the schema: > > http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8 > The following patch removes the title field before adding a new one, which > has resolved the issue for me: > --- MoreIndexingFilter.old 2011-09-30 11:44:35.000000000 +0000 > +++ MoreIndexingFilter.java 2011-09-30 09:58:48.000000000 +0000 > @@ -276,6 +276,7 @@ > for (int i=0; i<patterns.length; i++) { > if (matcher.contains(contentDisposition,patterns[i])) { > result = matcher.getMatch(); > + doc.removeField("title"); > doc.add("title", result.group(1)); > break; > } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira