Proposal for SolrIndexWriter

Lajos Tue, 14 Jan 2014 05:09:20 -0800

Hi all,

I've been working with Nutch/Solr integration for several enterprisesearch projects for clients (as well as my forthcoming Solr book). Ithink there are some real issues with the paradigm, and I'd like topropose a slightly modified approach which I've had to take myself.

I think its backwards to base mapping of the NutchDocument toSolrDocument based on the fields in the former. There are several problems:

1) this requires Solr to support all Nutch fields, which might not bethe case (like segment). That is an unreasonable requirement2) you can map a Nutch field to at most 2 Solr fields (i.e. one via a<field> and one via a <copy> tag because the source attribute is the Mapkey and therefore you can only have one)3) there is no support for any transformations, literals, etc, like sayfor Solr data import

For example, I've built an enterprise search tool that aggregates lotsof different data sources together and uses Nutch to crawl the intranet.The schema doesn't match everything Nutch sends. I have some literalsthat need to be set and I need transformations.

My approach was to reverse the building of the SolrDocument, andpopulate the doc based on the Solr destination fields as defined insolrindex-mapping.xml, i.e., it populates the doc based on what thetarget Solr wants to receive, not just what Nutch wants to send.

The map of fields in solrindex-mapping.xml is now keyed by dest, i.e.the Solr field name, not source. That way, I can map a source tomultiple destinations if I want. I further add a mapping type attribute(defaults to just a simple copy from Nutch to Solr) that supportsliterals and (shortly) transformations.

The change is easy, works well and fits better I think with the Solrparadigm. I've done this change in the 1.x plugin but obviously it caneasily port to 2.x.

If you see some merit to this approach, I'd can open a JIRA and submitthe changes. I also have somewhere an apache.org account (from myopenejb days) and would be happy to actually help implement it if you'dlike. I think adding in transformations would be a further benefit.


Let me know.

Thanks,

Lajos

Proposal for SolrIndexWriter

Reply via email to