so I've been getting this error "multiple_values_encountered_for_non_multiValued_field_title" every once in a while when I am trying to run solrindex. I can now say that this is being caused by index-more plug in (MoreIndexingFilter.java)

private NutchDocument resetTitle(NutchDocument doc, ParseData data, String url) {
            String contentDisposition = 
data.getMeta(Metadata.CONTENT_DISPOSITION);
            if (contentDisposition == null)
              return doc;
        
            for (int i=0; i<patterns.length; i++) {
              Matcher matcher = patterns[i].matcher(contentDisposition);
              if (matcher.find()) {
                doc.add("title", matcher.group(1));
                break;
              }
            }
           return doc;
          }


the problem here is that in my case this function is not reseting but it is just adding a new title. it seems that the original idea was that if CONTENT_DISPOSITION exist then the document will not have a title set from other plug ins (namely index-basic). unfortunately this seems not to be always the case as you can see by running this command:

bin/nutch indexchecker http://www.2modern.com/site/gift-registry.html

what i do get (the part that is relevant) is:

        
tstamp :        Tue Feb 21 13:18:13 PST 2012
type :  text/html
type :  text
type :  html
date :  Tue Feb 21 13:18:13 PST 2012
url :   http://www.2modern.com/site/gift-registry.html
content : 2Modern Gift Registry Modern Furniture & Lighting items in cart 0 checkout Returning 2Modern cu
user_ranking :  25.0
title : 2Modern Gift Registry
title : gift-registry.html
plutoz_ranking :        10.0
categories :    Furniture Home
contentLength : 12924

and as you can see there are 2 titles. I think it would be very easy to fix that. just check to see if a title exist already before setting the name of the file as title:

if (contentDisposition == null || null != doc.getField("title"))
              return doc;


or if the substitution must happen in presence of CONTENT_DISPOSITION, at least remove the old one:

if (matcher.find()) {
        doc.remove("title");
        doc.add("title", matcher.group(1));
        break;
 }


now that being said, the real problem here is why NutchDocument doesn't observe the schema.xml file and alway assumes that all fields are multi value?

public void add(String name, Object value) {
53          NutchField field = fields.get(name);
54          if (field == null) {
55            field = new NutchField(value);
56            fields.put(name, field);
57          } else {
58      ----> field.add(value);  <---
59          }
60        }

--
Kaveh Minooie

www.plutoz.com

Reply via email to