Brian created NUTCH-1896:
----------------------------

             Summary: SolrDeleteDuplicates does not use the mapped Solr field 
names from solrindex-mapping.xml
                 Key: NUTCH-1896
                 URL: https://issues.apache.org/jira/browse/NUTCH-1896
             Project: Nutch
          Issue Type: Bug
          Components: indexer
            Reporter: Brian


SolrDeleteDuplicates uses the hard-coded field names specified in 
SolrConstants.java to get all the fields (id, content, etc.) from Solr when 
deleting duplicates.

However this ignores the mappings specified in solrindex-mapping.xml - these 
fields may have been mapped to other fields at index time.

E.g.:
At index time, "id" is mapped to "asset_id"
At dedup time - "id" is used to get the field from Solr - error - no such field 
exists in Solr.

SolrDeleteDuplicates should use the same mappings defined for indexing, 
otherwise it can't be used for any setup renaming the internal nutch fields 
used in deduplication.  

The way I fixed it was to instantiate the SolrMappingReader during 
initialization and store the mapped field names in the hadoop configuration, 
e.g.:

  public void setSolrFieldMappings() throws IOException{
    SolrMappingReader solrMapping = SolrMappingReader.getInstance(getConf());

        getConf().set(SolrConstants.ID_FIELD, 
                          solrMapping.mapKey(SolrConstants.ID_FIELD_DEFAULT));
        getConf().set(SolrConstants.BOOST_FIELD, 
                          
solrMapping.mapKey(SolrConstants.BOOST_FIELD_DEFAULT));
        getConf().set(SolrConstants.TIMESTAMP_FIELD, 
                          
solrMapping.mapKey(SolrConstants.TIMESTAMP_FIELD_DEFAULT));
        getConf().set(SolrConstants.TITLE_FIELD, 
                          
solrMapping.mapKey(SolrConstants.TITLE_FIELD_DEFAULT));
        getConf().set(SolrConstants.CONTENT_FIELD, 
                          
solrMapping.mapKey(SolrConstants.CONTENT_FIELD_DEFAULT));

  }


Called in dedup method:
  public boolean dedup(String solrUrl)
  throws IOException, InterruptedException, ClassNotFoundException {
    LOG.info("SolrDeleteDuplicates: starting...");
    LOG.info("SolrDeleteDuplicates: Solr url: " + solrUrl);
    
    getConf().set(SolrConstants.SERVER_URL, solrUrl);
        
        setSolrFieldMappings();
    
    Job job = new Job(getConf(), "solrdedup");

    job.setInputFormatClass(SolrInputFormat.class);
    job.setOutputFormatClass(NullOutputFormat.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(SolrRecord.class);
    job.setMapperClass(Mapper.class);
    job.setReducerClass(SolrDeleteDuplicates.class);

    return job.waitForCompletion(true);    
  }






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to