[
https://issues.apache.org/jira/browse/NUTCH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14268444#comment-14268444
]
Hudson commented on NUTCH-1140:
-------------------------------
SUCCESS: Integrated in Nutch-trunk #2923 (See
[https://builds.apache.org/job/Nutch-trunk/2923/])
NUTCH-1140 index-more plugin, resetTitle creates multiple values in title field
(snagel: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1650181)
* /nutch/trunk/CHANGES.txt
*
/nutch/trunk/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
*
/nutch/trunk/src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
> index-more plugin, resetTitle method creates multiple values in the Title
> field
> -------------------------------------------------------------------------------
>
> Key: NUTCH-1140
> URL: https://issues.apache.org/jira/browse/NUTCH-1140
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.3
> Reporter: Joe Liedtke
> Priority: Minor
> Fix For: 1.10
>
> Attachments: 0001-NUTCH-1140-2.x.patch, 0001-NUTCH-1140-trunk.patch,
> MoreIndexingFilter.093011.patch, NUTCH-1140-trunk-v2.patch
>
>
> From the comments in MoreIndexingFilter.java, the index-more plugin is meant
> to reset the Title field of a document if it contains a Content-Disposition
> header. The current behavior is to add a Title regardless of whether one
> exists or not, which can cause issues down the line with the Solr Indexing
> process, and based on a thread in the nutch user list it appears that this is
> causing some users to mark the title as multi-valued in the schema:
>
> http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8
> The following patch removes the title field before adding a new one, which
> has resolved the issue for me:
> --- MoreIndexingFilter.old 2011-09-30 11:44:35.000000000 +0000
> +++ MoreIndexingFilter.java 2011-09-30 09:58:48.000000000 +0000
> @@ -276,6 +276,7 @@
> for (int i=0; i<patterns.length; i++) {
> if (matcher.contains(contentDisposition,patterns[i])) {
> result = matcher.getMatch();
> + doc.removeField("title");
> doc.add("title", result.group(1));
> break;
> }
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)