[ 
https://issues.apache.org/jira/browse/NUTCH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe Liedtke updated NUTCH-1140:
-------------------------------

    Description: 
>From the comments in MoreIndexingFilter.java, the index-more plugin is meant 
>to reset the Title field of a document if it contains a Content-Disposition 
>header. The current behavior is to add a Title regardless of whether one 
>exists or not, which can cause issues down the line with the Solr Indexing 
>process, and based on a thread in the nutch user list it appears that this is 
>causing some users to mark the title as multi-valued in the schema:
  
http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8

The following patch removes the title field before adding a new one, which has 
resolved the issue for me:


--- MoreIndexingFilter.old      2011-09-30 11:44:35.000000000 +0000
+++ MoreIndexingFilter.java     2011-09-30 09:58:48.000000000 +0000
@@ -276,6 +276,7 @@
     for (int i=0; i<patterns.length; i++) {
       if (matcher.contains(contentDisposition,patterns[i])) {
         result = matcher.getMatch();
+        doc.removeField("title");
         doc.add("title", result.group(1));
         break;
       }




  was:
>From the comments in MoreIndexingFilter.java, the index-more plugin is meant 
>to reset the Title field of a document if it contains a Content-Disposition 
>header. The current behavior is to add a Title regardless of whether one 
>exists or not, which can cause issues down the line with the Solr Indexing 
>process [and based on messages in the nutch user list it appears that this is 
>causing some users to mark the title as multi-valued in the schema -- 
>http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8].

The following patch removes the title field before adding a new one, which has 
resolved the issue for me:


--- MoreIndexingFilter.old      2011-09-30 11:44:35.000000000 +0000
+++ MoreIndexingFilter.java     2011-09-30 09:58:48.000000000 +0000
@@ -276,6 +276,7 @@
     for (int i=0; i<patterns.length; i++) {
       if (matcher.contains(contentDisposition,patterns[i])) {
         result = matcher.getMatch();
+        doc.removeField("title");
         doc.add("title", result.group(1));
         break;
       }




    
> index-more plugin, resetTitle method creates multiple values in the Title 
> field
> -------------------------------------------------------------------------------
>
>                 Key: NUTCH-1140
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1140
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.3
>            Reporter: Joe Liedtke
>            Priority: Minor
>         Attachments: MoreIndexingFilter.093011.patch
>
>
> From the comments in MoreIndexingFilter.java, the index-more plugin is meant 
> to reset the Title field of a document if it contains a Content-Disposition 
> header. The current behavior is to add a Title regardless of whether one 
> exists or not, which can cause issues down the line with the Solr Indexing 
> process, and based on a thread in the nutch user list it appears that this is 
> causing some users to mark the title as multi-valued in the schema:
>   
> http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8
> The following patch removes the title field before adding a new one, which 
> has resolved the issue for me:
> --- MoreIndexingFilter.old    2011-09-30 11:44:35.000000000 +0000
> +++ MoreIndexingFilter.java   2011-09-30 09:58:48.000000000 +0000
> @@ -276,6 +276,7 @@
>      for (int i=0; i<patterns.length; i++) {
>        if (matcher.contains(contentDisposition,patterns[i])) {
>          result = matcher.getMatch();
> +        doc.removeField("title");
>          doc.add("title", result.group(1));
>          break;
>        }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to