[jira] Issue Comment Edited: (NUTCH-684) Dedup support for Solr

Dmitry Lihachev (JIRA) Fri, 20 Feb 2009 02:11:30 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675311#action_12675311
 ]


dmitry.lihachev edited comment on NUTCH-684 at 2/20/09 2:10 AM:
----------------------------------------------------------------

bq. there is a silent assumption that Solr schema uses "id" field as unique 
key, and that this field contains the URL of the document. First, shouldn't 
this be "url" field? Because as far as I can see the field name "id" is not 
used anywhere in SolrIndexer/SolrWriter - please correct me if I missed 
something. At least this assumption should be spelled out in javadocs, both on 
the indexing side and on the dedup side. (Actually, we should have added an 
example of the minimum required Solr schema when the original Nutch/Solr 
integration was committed)

"id" field defined in schema.xml (NUTCH-442)

      was (Author: dmitry.lihachev):
    bq. there is a silent assumption that Solr schema uses "id" field as unique 
key, and that this field contains the URL of the document. First, shouldn't 
this be "url" field? Because as far as I can see the field name "id" is not 
used anywhere in SolrIndexer/SolrWriter - please correct me if I missed 
something. At least this assumption should be spelled out in javadocs, both on 
the indexing side and on the dedup side. (Actually, we should have added an 
example of the minimum required Solr schema when the original Nutch/Solr 
integration was committed)

"id" field defined in schema.xml (NUTCH-422)
  
> Dedup support for Solr
> ----------------------
>
>                 Key: NUTCH-684
>                 URL: https://issues.apache.org/jira/browse/NUTCH-684
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>         Attachments: NUTCH-684_bin_nutch.patch, NUTCH-684_solrdedup_v2.patch, 
> solrdedup.patch
>
>
> After NUTCH-442, nutch now can index to both solr and lucene. However, 
> duplicate deletion feature (based on digests) is only available in lucene. It 
> should also be available for solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (NUTCH-684) Dedup support for Solr

Reply via email to