Lajos Moczar created NUTCH-1734:
-----------------------------------
Summary: Make SolrIndexWriter more intelligent
Key: NUTCH-1734
URL: https://issues.apache.org/jira/browse/NUTCH-1734
Project: Nutch
Issue Type: Improvement
Affects Versions: 2.2.1, 1.7
Reporter: Lajos Moczar
Priority: Minor
The current mapping of the NutchDocument to SolrDocument is based on the fields
in the former which potentially can cause problems when you are using an
existing Solr schema:
1) the existing logic requires Solr to support all Nutch fields, which might
not be the case (like segment).
2) you can map a Nutch field to at most 2 Solr fields (i.e. one via a <field>
and one via a <copy> tag because the source attribute is the Map key and
therefore you can only have one.
Additionally, it would be nice to support some level of transformations,
literals, etc, like used in Solr DIH.
I propose to make the code more intelligent so that, while supporting the
existing "strict" mapping that people are used to, allows more flexible and
intelligent mapping. It will also include a transformation architecture that
can be expanded over time.
The general approach is to reverse the building of the SolrDocument, and
populate the doc based on the Solr destination fields as defined in
solrindex-mapping.xml, i.e., it populates the doc based on what the target Solr
wants to receive, not just what Nutch wants to send. The Map of fields in
solrindex-mapping.xml will be keyed by dest, i.e. the Solr field name, not
source. That way one can map a source to multiple destinations. A mapping type
attribute (defaults to just a simple copy from Nutch to Solr) will support
literals and transformations.
Note that a default "strict" mapping (i.e. the Solr schema by default MUST
support all NutchDocument fields) will be supported for backwards
compatibility. I assume this will be what people want.
I will submit patches in due course.
--
This message was sent by Atlassian JIRA
(v6.2#6252)