Lajos Moczar created NUTCH-1734:
-----------------------------------

             Summary: Make SolrIndexWriter more intelligent
                 Key: NUTCH-1734
                 URL: https://issues.apache.org/jira/browse/NUTCH-1734
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 2.2.1, 1.7
            Reporter: Lajos Moczar
            Priority: Minor


The current mapping of the NutchDocument to SolrDocument is based on the fields 
in the former which potentially can cause problems when you are using an 
existing Solr schema:

1) the existing logic requires Solr to support all Nutch fields, which might 
not be the case (like segment).

2) you can map a Nutch field to at most 2 Solr fields (i.e. one via a <field> 
and one via a <copy> tag because the source attribute is the Map key and 
therefore you can only have one.

Additionally, it would be nice to support some level of transformations, 
literals, etc, like used in Solr DIH.

I propose to make the code more intelligent so that, while supporting the 
existing "strict" mapping that people are used to, allows more flexible and 
intelligent mapping. It will also include a transformation architecture that 
can be expanded over time.

The general approach is to reverse the building of the SolrDocument, and 
populate the doc based on the Solr destination fields as defined in 
solrindex-mapping.xml, i.e., it populates the doc based on what the target Solr 
wants to receive, not just what Nutch wants to send. The Map of fields in 
solrindex-mapping.xml will be keyed by dest, i.e. the Solr field name, not 
source. That way one can map a source to multiple destinations. A mapping type 
attribute (defaults to just a simple copy from Nutch to Solr) will support 
literals and transformations.

Note that a default "strict" mapping (i.e. the Solr schema by default MUST 
support all NutchDocument fields) will be supported for backwards 
compatibility. I assume this will be what people want.

I will submit patches in due course.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to