Lajos Moczar created NUTCH-1734: ----------------------------------- Summary: Make SolrIndexWriter more intelligent Key: NUTCH-1734 URL: https://issues.apache.org/jira/browse/NUTCH-1734 Project: Nutch Issue Type: Improvement Affects Versions: 2.2.1, 1.7 Reporter: Lajos Moczar Priority: Minor
The current mapping of the NutchDocument to SolrDocument is based on the fields in the former which potentially can cause problems when you are using an existing Solr schema: 1) the existing logic requires Solr to support all Nutch fields, which might not be the case (like segment). 2) you can map a Nutch field to at most 2 Solr fields (i.e. one via a <field> and one via a <copy> tag because the source attribute is the Map key and therefore you can only have one. Additionally, it would be nice to support some level of transformations, literals, etc, like used in Solr DIH. I propose to make the code more intelligent so that, while supporting the existing "strict" mapping that people are used to, allows more flexible and intelligent mapping. It will also include a transformation architecture that can be expanded over time. The general approach is to reverse the building of the SolrDocument, and populate the doc based on the Solr destination fields as defined in solrindex-mapping.xml, i.e., it populates the doc based on what the target Solr wants to receive, not just what Nutch wants to send. The Map of fields in solrindex-mapping.xml will be keyed by dest, i.e. the Solr field name, not source. That way one can map a source to multiple destinations. A mapping type attribute (defaults to just a simple copy from Nutch to Solr) will support literals and transformations. Note that a default "strict" mapping (i.e. the Solr schema by default MUST support all NutchDocument fields) will be supported for backwards compatibility. I assume this will be what people want. I will submit patches in due course. -- This message was sent by Atlassian JIRA (v6.2#6252)