Possible sharded and replicated replacement for ExternalFileFields in SolrCloud

Simone Gianni Fri, 23 Nov 2012 11:39:51 -0800

Hi all,
this is my first posting on this list so bear with me if this is not the
right place/the right topic etc..


I'm currently migrating a Solr 3.x system to SolrCloud. It uses
ExternalFileField for the common "popularity" ranking.

I've tried to get ExternalFileField to work in SolrCloud, but it is quite a
problem now that the data dir is not directly accessible (it's "in the
cloud"). Moreover, external files served well for the purpose, but with
millions of KV pairs that need to be generated for each update are not
"cloud scale".

After asking on the users mailing list for advice, I started coding and
came up with a prototype of a possible replacement, that I'd like to
contribute. It's currently "working", but far from having being heavily
tested under the various scenarios of a ZooKeeper based system. It is
currently implemented as a plugin.

Basically it's an UpdateProcessor, to be placed in the update chain after
the DistributedUpdateProcessor and before the RunUpdateProcessor, as usual.

This processor looks at the add/update request, search for a specific field
(the "popularity" field for example), and delegates its persistence to a
different system than the lucene index. My current stupid implementation
simply caches it in a concurrent sorted map which is "dumped" to a file
upon commit. It's possible however to plug implementations for any embedded
simple KV store (like JDBM for example) to provide much faster commit time
and avoid loading everything in ram at startup.

This makes it possible to send to SolrCloud (or Solr without the cloud)
updates about the "popularity" field as normal document updates. Soft
commit to have NRT and rollback are already supported.

The fields persisted in the alternative system are removed from the
document, so they should never reach Lucene at all. If an update consists
only of these specific fields, then it is not propagated to the Lucene
index, so that (AFAIU) no reindexing or index growth has to take place (not
even a Lucene commit, if the commit involves only updates containing only
these fields).

Then a specific field type (code mostly copy-pasted from ExternalFileField
:) ) is able to use the same instance to get an array of float values for a
specific Reader and use it in a ValueSource for all the common uses.

Since the processor is placed after the DistributedUpdateProcessor, it is
(AFAIU) sharded and replicated (I know it's sharded for sure, not yet fully
tested replication) the same way the Lucene indexes are sharded, so each
core has its index and its "external files".


Now my questions are :
1) do you thing this is interesting, right, wrong, dangerous etc.. ?
2) do you see any error in my reasoning about update handlers?
3) how does creation/synchronization of a new replica happens? (so where
should I have to plug to replicate also the "external files")
4) I can to contribute this code, if you think it's worth, but it needs to
be worked a bit before being ready to be submitted as a "ready to apply
patch", could we use an Apache Lab or a sandbox space to cooperate on this
if someone is willing to help?


Let me know,
Simone

Possible sharded and replicated replacement for ExternalFileFields in SolrCloud

Reply via email to