[jira] [Updated] (NUTCH-1896) SolrDeleteDuplicates does not use the mapped Solr field names from solrindex-mapping.xml

Brian (JIRA) Wed, 10 Dec 2014 08:30:59 -0800

     [ 
https://issues.apache.org/jira/browse/NUTCH-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Brian updated NUTCH-1896:
-------------------------
    Description: 
SolrDeleteDuplicates uses the hard-coded field names specified in 
SolrConstants.java to get all the fields (id, content, etc.) from Solr when 
deleting duplicates.

However this ignores the mappings specified in solrindex-mapping.xml - these 
fields may have been mapped to other fields at index time.

E.g.:
At index time, "id" is mapped to "asset_id"
At dedup time - "id" is used to get the field from Solr - error - no such field 
exists in Solr.

SolrDeleteDuplicates should use the same mappings defined for indexing, 
otherwise it can't be used for any setup renaming the internal nutch fields 
used in deduplication.  

The way I fixed it was to instantiate the SolrMappingReader during 
initialization and store the mapped field names in the hadoop configuration, 
e.g.:

{code:java|borderStyle=solid}
  public void setSolrFieldMappings() throws IOException{
    SolrMappingReader solrMapping = SolrMappingReader.getInstance(getConf());

        getConf().set(SolrConstants.ID_FIELD, 
                          solrMapping.mapKey(SolrConstants.ID_FIELD_DEFAULT));
        getConf().set(SolrConstants.BOOST_FIELD, 
                          
solrMapping.mapKey(SolrConstants.BOOST_FIELD_DEFAULT));
        getConf().set(SolrConstants.TIMESTAMP_FIELD, 
                          
solrMapping.mapKey(SolrConstants.TIMESTAMP_FIELD_DEFAULT));
        getConf().set(SolrConstants.TITLE_FIELD, 
                          
solrMapping.mapKey(SolrConstants.TITLE_FIELD_DEFAULT));
        getConf().set(SolrConstants.CONTENT_FIELD, 
                          
solrMapping.mapKey(SolrConstants.CONTENT_FIELD_DEFAULT));

  }
{code}


Called in dedup method:
{code:java|borderStyle=solid}
  public boolean dedup(String solrUrl)
  throws IOException, InterruptedException, ClassNotFoundException {
    LOG.info("SolrDeleteDuplicates: starting...");
    LOG.info("SolrDeleteDuplicates: Solr url: " + solrUrl);
    
    getConf().set(SolrConstants.SERVER_URL, solrUrl);
        
        setSolrFieldMappings();
    
    Job job = new Job(getConf(), "solrdedup");

    job.setInputFormatClass(SolrInputFormat.class);
    job.setOutputFormatClass(NullOutputFormat.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(SolrRecord.class);
    job.setMapperClass(Mapper.class);
    job.setReducerClass(SolrDeleteDuplicates.class);

    return job.waitForCompletion(true);    
  }
{code}



  was:
SolrDeleteDuplicates uses the hard-coded field names specified in 
SolrConstants.java to get all the fields (id, content, etc.) from Solr when 
deleting duplicates.

However this ignores the mappings specified in solrindex-mapping.xml - these 
fields may have been mapped to other fields at index time.

E.g.:
At index time, "id" is mapped to "asset_id"
At dedup time - "id" is used to get the field from Solr - error - no such field 
exists in Solr.

SolrDeleteDuplicates should use the same mappings defined for indexing, 
otherwise it can't be used for any setup renaming the internal nutch fields 
used in deduplication.  

The way I fixed it was to instantiate the SolrMappingReader during 
initialization and store the mapped field names in the hadoop configuration, 
e.g.:

  public void setSolrFieldMappings() throws IOException{
    SolrMappingReader solrMapping = SolrMappingReader.getInstance(getConf());

        getConf().set(SolrConstants.ID_FIELD, 
                          solrMapping.mapKey(SolrConstants.ID_FIELD_DEFAULT));
        getConf().set(SolrConstants.BOOST_FIELD, 
                          
solrMapping.mapKey(SolrConstants.BOOST_FIELD_DEFAULT));
        getConf().set(SolrConstants.TIMESTAMP_FIELD, 
                          
solrMapping.mapKey(SolrConstants.TIMESTAMP_FIELD_DEFAULT));
        getConf().set(SolrConstants.TITLE_FIELD, 
                          
solrMapping.mapKey(SolrConstants.TITLE_FIELD_DEFAULT));
        getConf().set(SolrConstants.CONTENT_FIELD, 
                          
solrMapping.mapKey(SolrConstants.CONTENT_FIELD_DEFAULT));

  }


Called in dedup method:
  public boolean dedup(String solrUrl)
  throws IOException, InterruptedException, ClassNotFoundException {
    LOG.info("SolrDeleteDuplicates: starting...");
    LOG.info("SolrDeleteDuplicates: Solr url: " + solrUrl);
    
    getConf().set(SolrConstants.SERVER_URL, solrUrl);
        
        setSolrFieldMappings();
    
    Job job = new Job(getConf(), "solrdedup");

    job.setInputFormatClass(SolrInputFormat.class);
    job.setOutputFormatClass(NullOutputFormat.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(SolrRecord.class);
    job.setMapperClass(Mapper.class);
    job.setReducerClass(SolrDeleteDuplicates.class);

    return job.waitForCompletion(true);    
  }





> SolrDeleteDuplicates does not use the mapped Solr field names from 
> solrindex-mapping.xml
> ----------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1896
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1896
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>            Reporter: Brian
>
> SolrDeleteDuplicates uses the hard-coded field names specified in 
> SolrConstants.java to get all the fields (id, content, etc.) from Solr when 
> deleting duplicates.
> However this ignores the mappings specified in solrindex-mapping.xml - these 
> fields may have been mapped to other fields at index time.
> E.g.:
> At index time, "id" is mapped to "asset_id"
> At dedup time - "id" is used to get the field from Solr - error - no such 
> field exists in Solr.
> SolrDeleteDuplicates should use the same mappings defined for indexing, 
> otherwise it can't be used for any setup renaming the internal nutch fields 
> used in deduplication.  
> The way I fixed it was to instantiate the SolrMappingReader during 
> initialization and store the mapped field names in the hadoop configuration, 
> e.g.:
> {code:java|borderStyle=solid}
>   public void setSolrFieldMappings() throws IOException{
>     SolrMappingReader solrMapping = SolrMappingReader.getInstance(getConf());
>       getConf().set(SolrConstants.ID_FIELD, 
>                         solrMapping.mapKey(SolrConstants.ID_FIELD_DEFAULT));
>       getConf().set(SolrConstants.BOOST_FIELD, 
>                         
> solrMapping.mapKey(SolrConstants.BOOST_FIELD_DEFAULT));
>       getConf().set(SolrConstants.TIMESTAMP_FIELD, 
>                         
> solrMapping.mapKey(SolrConstants.TIMESTAMP_FIELD_DEFAULT));
>       getConf().set(SolrConstants.TITLE_FIELD, 
>                         
> solrMapping.mapKey(SolrConstants.TITLE_FIELD_DEFAULT));
>       getConf().set(SolrConstants.CONTENT_FIELD, 
>                         
> solrMapping.mapKey(SolrConstants.CONTENT_FIELD_DEFAULT));
>   }
> {code}
> Called in dedup method:
> {code:java|borderStyle=solid}
>   public boolean dedup(String solrUrl)
>   throws IOException, InterruptedException, ClassNotFoundException {
>     LOG.info("SolrDeleteDuplicates: starting...");
>     LOG.info("SolrDeleteDuplicates: Solr url: " + solrUrl);
>     
>     getConf().set(SolrConstants.SERVER_URL, solrUrl);
>       
>       setSolrFieldMappings();
>     
>     Job job = new Job(getConf(), "solrdedup");
>     job.setInputFormatClass(SolrInputFormat.class);
>     job.setOutputFormatClass(NullOutputFormat.class);
>     job.setMapOutputKeyClass(Text.class);
>     job.setMapOutputValueClass(SolrRecord.class);
>     job.setMapperClass(Mapper.class);
>     job.setReducerClass(SolrDeleteDuplicates.class);
>     return job.waitForCompletion(true);    
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1896) SolrDeleteDuplicates does not use the mapped Solr field names from solrindex-mapping.xml

Reply via email to