[ 
https://issues.apache.org/jira/browse/NUTCH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123126#comment-13123126
 ] 

Joe Liedtke commented on NUTCH-1140:
------------------------------------

True, however the default schema only allows for one title. It seems like the 
filter should either make this behavior configurable or reset the title. 
Additionally, since the method is named resetTitle (and not addTitle, 
appendTitle, or insertYetAnotherTitle) I can only assume that the intent was to 
reset the title with a new value rather than append a second value.

The patch for #1004 should help to mitigate the issue (I haven't had a chance 
to test it yet, but it makes sense that it could keep this from coming up...), 
however future plugins could cause this bug to rear its ugly head again.  I'd 
recommend fixing it now to save future headaches. How does that sound?
                
> index-more plugin, resetTitle method creates multiple values in the Title 
> field
> -------------------------------------------------------------------------------
>
>                 Key: NUTCH-1140
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1140
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.3
>            Reporter: Joe Liedtke
>            Priority: Minor
>         Attachments: MoreIndexingFilter.093011.patch
>
>
> From the comments in MoreIndexingFilter.java, the index-more plugin is meant 
> to reset the Title field of a document if it contains a Content-Disposition 
> header. The current behavior is to add a Title regardless of whether one 
> exists or not, which can cause issues down the line with the Solr Indexing 
> process, and based on a thread in the nutch user list it appears that this is 
> causing some users to mark the title as multi-valued in the schema:
>   
> http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8
> The following patch removes the title field before adding a new one, which 
> has resolved the issue for me:
> --- MoreIndexingFilter.old    2011-09-30 11:44:35.000000000 +0000
> +++ MoreIndexingFilter.java   2011-09-30 09:58:48.000000000 +0000
> @@ -276,6 +276,7 @@
>      for (int i=0; i<patterns.length; i++) {
>        if (matcher.contains(contentDisposition,patterns[i])) {
>          result = matcher.getMatch();
> +        doc.removeField("title");
>          doc.add("title", result.group(1));
>          break;
>        }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to