index-more plugin, resetTitle method creates multiple values in the Title field
-------------------------------------------------------------------------------
Key: NUTCH-1140
URL: https://issues.apache.org/jira/browse/NUTCH-1140
Project: Nutch
Issue Type: Bug
Components: indexer
Affects Versions: 1.3
Reporter: Joe Liedtke
Priority: Minor
Attachments: MoreIndexingFilter.093011.patch
>From the comments in MoreIndexingFilter.java, the index-more plugin is meant
>to reset the Title field of a document if it contains a Content-Disposition
>header. The current behavior is to add a Title regardless of whether one
>exists or not, which can cause issues down the line with the Solr Indexing
>process [and based on messages in the nutch user list it appears that this is
>causing some users to mark the title as multi-valued in the schema --
>http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8].
The following patch removes the title field before adding a new one, which has
resolved the issue for me:
--- MoreIndexingFilter.old 2011-09-30 11:44:35.000000000 +0000
+++ MoreIndexingFilter.java 2011-09-30 09:58:48.000000000 +0000
@@ -276,6 +276,7 @@
for (int i=0; i<patterns.length; i++) {
if (matcher.contains(contentDisposition,patterns[i])) {
result = matcher.getMatch();
+ doc.removeField("title");
doc.add("title", result.group(1));
break;
}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira