[
https://issues.apache.org/jira/browse/NUTCH-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Brian updated NUTCH-1896:
-------------------------
Description:
SolrDeleteDuplicates uses the hard-coded field names specified in
SolrConstants.java to get all the fields (id, content, etc.) from Solr when
deleting duplicates.
However this ignores the mappings specified in solrindex-mapping.xml - these
fields may have been mapped to other fields at index time.
E.g.:
At index time, "id" is mapped to "asset_id"
At dedup time - "id" is used to get the field from Solr - error - no such field
exists in Solr.
SolrDeleteDuplicates should use the same mappings defined for indexing,
otherwise it can't be used for any setup renaming the internal nutch fields
used in deduplication.
The way I fixed it was to instantiate the SolrMappingReader during
initialization and store the mapped field names in the hadoop configuration,
e.g.:
{code:java|borderStyle=solid}
public void setSolrFieldMappings() throws IOException{
SolrMappingReader solrMapping = SolrMappingReader.getInstance(getConf());
getConf().set(SolrConstants.ID_FIELD,
solrMapping.mapKey(SolrConstants.ID_FIELD_DEFAULT));
getConf().set(SolrConstants.BOOST_FIELD,
solrMapping.mapKey(SolrConstants.BOOST_FIELD_DEFAULT));
getConf().set(SolrConstants.TIMESTAMP_FIELD,
solrMapping.mapKey(SolrConstants.TIMESTAMP_FIELD_DEFAULT));
getConf().set(SolrConstants.TITLE_FIELD,
solrMapping.mapKey(SolrConstants.TITLE_FIELD_DEFAULT));
getConf().set(SolrConstants.CONTENT_FIELD,
solrMapping.mapKey(SolrConstants.CONTENT_FIELD_DEFAULT));
}
{code}
Called in dedup method:
{code:java|borderStyle=solid}
public boolean dedup(String solrUrl)
throws IOException, InterruptedException, ClassNotFoundException {
LOG.info("SolrDeleteDuplicates: starting...");
LOG.info("SolrDeleteDuplicates: Solr url: " + solrUrl);
getConf().set(SolrConstants.SERVER_URL, solrUrl);
setSolrFieldMappings();
Job job = new Job(getConf(), "solrdedup");
job.setInputFormatClass(SolrInputFormat.class);
job.setOutputFormatClass(NullOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(SolrRecord.class);
job.setMapperClass(Mapper.class);
job.setReducerClass(SolrDeleteDuplicates.class);
return job.waitForCompletion(true);
}
{code}
was:
SolrDeleteDuplicates uses the hard-coded field names specified in
SolrConstants.java to get all the fields (id, content, etc.) from Solr when
deleting duplicates.
However this ignores the mappings specified in solrindex-mapping.xml - these
fields may have been mapped to other fields at index time.
E.g.:
At index time, "id" is mapped to "asset_id"
At dedup time - "id" is used to get the field from Solr - error - no such field
exists in Solr.
SolrDeleteDuplicates should use the same mappings defined for indexing,
otherwise it can't be used for any setup renaming the internal nutch fields
used in deduplication.
The way I fixed it was to instantiate the SolrMappingReader during
initialization and store the mapped field names in the hadoop configuration,
e.g.:
public void setSolrFieldMappings() throws IOException{
SolrMappingReader solrMapping = SolrMappingReader.getInstance(getConf());
getConf().set(SolrConstants.ID_FIELD,
solrMapping.mapKey(SolrConstants.ID_FIELD_DEFAULT));
getConf().set(SolrConstants.BOOST_FIELD,
solrMapping.mapKey(SolrConstants.BOOST_FIELD_DEFAULT));
getConf().set(SolrConstants.TIMESTAMP_FIELD,
solrMapping.mapKey(SolrConstants.TIMESTAMP_FIELD_DEFAULT));
getConf().set(SolrConstants.TITLE_FIELD,
solrMapping.mapKey(SolrConstants.TITLE_FIELD_DEFAULT));
getConf().set(SolrConstants.CONTENT_FIELD,
solrMapping.mapKey(SolrConstants.CONTENT_FIELD_DEFAULT));
}
Called in dedup method:
public boolean dedup(String solrUrl)
throws IOException, InterruptedException, ClassNotFoundException {
LOG.info("SolrDeleteDuplicates: starting...");
LOG.info("SolrDeleteDuplicates: Solr url: " + solrUrl);
getConf().set(SolrConstants.SERVER_URL, solrUrl);
setSolrFieldMappings();
Job job = new Job(getConf(), "solrdedup");
job.setInputFormatClass(SolrInputFormat.class);
job.setOutputFormatClass(NullOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(SolrRecord.class);
job.setMapperClass(Mapper.class);
job.setReducerClass(SolrDeleteDuplicates.class);
return job.waitForCompletion(true);
}
> SolrDeleteDuplicates does not use the mapped Solr field names from
> solrindex-mapping.xml
> ----------------------------------------------------------------------------------------
>
> Key: NUTCH-1896
> URL: https://issues.apache.org/jira/browse/NUTCH-1896
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Reporter: Brian
>
> SolrDeleteDuplicates uses the hard-coded field names specified in
> SolrConstants.java to get all the fields (id, content, etc.) from Solr when
> deleting duplicates.
> However this ignores the mappings specified in solrindex-mapping.xml - these
> fields may have been mapped to other fields at index time.
> E.g.:
> At index time, "id" is mapped to "asset_id"
> At dedup time - "id" is used to get the field from Solr - error - no such
> field exists in Solr.
> SolrDeleteDuplicates should use the same mappings defined for indexing,
> otherwise it can't be used for any setup renaming the internal nutch fields
> used in deduplication.
> The way I fixed it was to instantiate the SolrMappingReader during
> initialization and store the mapped field names in the hadoop configuration,
> e.g.:
> {code:java|borderStyle=solid}
> public void setSolrFieldMappings() throws IOException{
> SolrMappingReader solrMapping = SolrMappingReader.getInstance(getConf());
> getConf().set(SolrConstants.ID_FIELD,
> solrMapping.mapKey(SolrConstants.ID_FIELD_DEFAULT));
> getConf().set(SolrConstants.BOOST_FIELD,
>
> solrMapping.mapKey(SolrConstants.BOOST_FIELD_DEFAULT));
> getConf().set(SolrConstants.TIMESTAMP_FIELD,
>
> solrMapping.mapKey(SolrConstants.TIMESTAMP_FIELD_DEFAULT));
> getConf().set(SolrConstants.TITLE_FIELD,
>
> solrMapping.mapKey(SolrConstants.TITLE_FIELD_DEFAULT));
> getConf().set(SolrConstants.CONTENT_FIELD,
>
> solrMapping.mapKey(SolrConstants.CONTENT_FIELD_DEFAULT));
> }
> {code}
> Called in dedup method:
> {code:java|borderStyle=solid}
> public boolean dedup(String solrUrl)
> throws IOException, InterruptedException, ClassNotFoundException {
> LOG.info("SolrDeleteDuplicates: starting...");
> LOG.info("SolrDeleteDuplicates: Solr url: " + solrUrl);
>
> getConf().set(SolrConstants.SERVER_URL, solrUrl);
>
> setSolrFieldMappings();
>
> Job job = new Job(getConf(), "solrdedup");
> job.setInputFormatClass(SolrInputFormat.class);
> job.setOutputFormatClass(NullOutputFormat.class);
> job.setMapOutputKeyClass(Text.class);
> job.setMapOutputValueClass(SolrRecord.class);
> job.setMapperClass(Mapper.class);
> job.setReducerClass(SolrDeleteDuplicates.class);
> return job.waitForCompletion(true);
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)