Hi all, this is my first posting on this list so bear with me if this is not the right place/the right topic etc..
I'm currently migrating a Solr 3.x system to SolrCloud. It uses ExternalFileField for the common "popularity" ranking. I've tried to get ExternalFileField to work in SolrCloud, but it is quite a problem now that the data dir is not directly accessible (it's "in the cloud"). Moreover, external files served well for the purpose, but with millions of KV pairs that need to be generated for each update are not "cloud scale". After asking on the users mailing list for advice, I started coding and came up with a prototype of a possible replacement, that I'd like to contribute. It's currently "working", but far from having being heavily tested under the various scenarios of a ZooKeeper based system. It is currently implemented as a plugin. Basically it's an UpdateProcessor, to be placed in the update chain after the DistributedUpdateProcessor and before the RunUpdateProcessor, as usual. This processor looks at the add/update request, search for a specific field (the "popularity" field for example), and delegates its persistence to a different system than the lucene index. My current stupid implementation simply caches it in a concurrent sorted map which is "dumped" to a file upon commit. It's possible however to plug implementations for any embedded simple KV store (like JDBM for example) to provide much faster commit time and avoid loading everything in ram at startup. This makes it possible to send to SolrCloud (or Solr without the cloud) updates about the "popularity" field as normal document updates. Soft commit to have NRT and rollback are already supported. The fields persisted in the alternative system are removed from the document, so they should never reach Lucene at all. If an update consists only of these specific fields, then it is not propagated to the Lucene index, so that (AFAIU) no reindexing or index growth has to take place (not even a Lucene commit, if the commit involves only updates containing only these fields). Then a specific field type (code mostly copy-pasted from ExternalFileField :) ) is able to use the same instance to get an array of float values for a specific Reader and use it in a ValueSource for all the common uses. Since the processor is placed after the DistributedUpdateProcessor, it is (AFAIU) sharded and replicated (I know it's sharded for sure, not yet fully tested replication) the same way the Lucene indexes are sharded, so each core has its index and its "external files". Now my questions are : 1) do you thing this is interesting, right, wrong, dangerous etc.. ? 2) do you see any error in my reasoning about update handlers? 3) how does creation/synchronization of a new replica happens? (so where should I have to plug to replicate also the "external files") 4) I can to contribute this code, if you think it's worth, but it needs to be worked a bit before being ready to be submitted as a "ready to apply patch", could we use an Apache Lab or a sandbox space to cooperate on this if someone is willing to help? Let me know, Simone
