Re: Proposal for SolrIndexWriter

Lajos Tue, 14 Jan 2014 14:09:08 -0800

I realise I should have made myself clearer on one point.

I understand that the current design comes from a Nutch-centricparadigm, in which Solr is used to hold the indexing data from Nutch. Inthis paradigm, I suppose the Nutch data needs to be fully mapped to Solr.

But I'm interested in a Solr-centric paradigm where Nutch is feedingdata to Solr for a Solr-based application to use. I don't have any ideawhich is more popular, but all my own uses of Nutch have required me tointegrate it to existing Solr schemas and for that, I have to have adifferent and much more flexible approach.

So maybe what I'm suggesting would be a parallel set of components forthe second scenario, given that the first would still need to besupported. Possibly the existing set of components could support bothparadigms, but that would be messy.


L


On 14/01/2014 14:07, Lajos wrote:

Hi all,

I've been working with Nutch/Solr integration for several enterprise
search projects for clients (as well as my forthcoming Solr book). I
think there are some real issues with the paradigm, and I'd like to
propose a slightly modified approach which I've had to take myself.

I think its backwards to base mapping of the NutchDocument to
SolrDocument based on the fields in the former. There are several problems:

1) this requires Solr to support all Nutch fields, which might not be
the case (like segment). That is an unreasonable requirement
2) you can map a Nutch field to at most 2 Solr fields (i.e. one via a
<field> and one via a <copy> tag because the source attribute is the Map
key and therefore you can only have one)
3) there is no support for any transformations, literals, etc, like say
for Solr data import

For example, I've built an enterprise search tool that aggregates lots
of different data sources together and uses Nutch to crawl the intranet.
The schema doesn't match everything Nutch sends. I have some literals
that need to be set and I need transformations.

My approach was to reverse the building of the SolrDocument, and
populate the doc based on the Solr destination fields as defined in
solrindex-mapping.xml, i.e., it populates the doc based on what the
target Solr wants to receive, not just what Nutch wants to send.

The map of fields in solrindex-mapping.xml is now keyed by dest, i.e.
the Solr field name, not source. That way, I can map a source to
multiple destinations if I want. I further add a mapping type attribute
(defaults to just a simple copy from Nutch to Solr) that supports
literals and (shortly) transformations.

The change is easy, works well and fits better I think with the Solr
paradigm. I've done this change in the 1.x plugin but obviously it can
easily port to 2.x.

If you see some merit to this approach, I'd can open a JIRA and submit
the changes. I also have somewhere an apache.org account (from my
openejb days) and would be happy to actually help implement it if you'd
like. I think adding in transformations would be a further benefit.

Let me know.

Thanks,

Lajos

Re: Proposal for SolrIndexWriter

Reply via email to