Re: Distributed Indexing

Upayavira Fri, 28 Jan 2011 05:03:25 -0800

Another point that will need some thought, as I have heard alluded to,
is error handling.


Currently, as I understand it, if you post 500 documents to Solr, and
one has an error, the whole batch will fail.

Leaving aside whether that is the best behaviour, it is a behaviour that
will be impossible to mimic in a distributed indexing scenario. (without
effectively implementing distributed transactions).

I guess the simplest would be to find a way to report back to the user
that documents for these shards succeeded, and documents for these
shards failed, and here's the error. The issue here is that when Solr
returns an error, it doesn't return error XML, it returns a (for
example) Tomcat stack trace (i.e. HTML). Perhaps all we can do here is
to embed that HTML as CDATA in the XML that the distributed request
handler returns to its client.

Then, worst case the client could fix the error and repost everything.
All documents would be re-indexed across all shards, but in the long
run, there's no big issue with that.

Upayavira

On Fri, 28 Jan 2011 12:24 +0000, "Upayavira" <u...@odoko.co.uk> wrote:
> Hi Soheb,
> 
> On Wed, 26 Jan 2011 16:29 +0000, "Soheb Mahmood"
> <soheb.luc...@gmail.com> wrote:
> 
> > We are going to implement distributed indexing for Solr - without the
> > use of SolrCloud (so it can be easily up-scaled). We have a deadline by
> > February to get this done, so we need to get cracking ;) 
> 
> :-)
>  
> > So far, we've had a look at the solr classes and thought about
> > distributed indexing on Solr, and we have come up with these ideas:
> > 
> > 1. We plan to modify SimplePostTool to accommodate posting to specific
> > shards. We are going to add an optional system property to allow the
> > user to specify a list of shards to index to Solr.
> > Example of this being "java
> > -Durl=http://localhost:7574/solr/collection1/update
> > -Dshards=localhost:8983/solr,localhost:7574/solr -jar post.jar <list of
> > XML files>"
> 
> As Yonik says, the SimplePostTool is really for testing. The shard
> information must be contained within the URL, and processed by an
> UpdateRequestHandler (called DistributedUpdateRequestHandler?). That
> way, you can embed that data into the solrconfig.xml file as an
> invariant or a default, or later it can be derived from Zookeeper in
> SolrCloud.
> 
> > We also plan to modify server request processing to handle distributed
> > indexing. We are looking at CommonsHttpSolrServer.java for ways to
> > accomplish this.
> > 
> > With all these changes, we realise that we are only modifying the Java
> > version, and that other languages need to be updated to accommodate our
> > changes (e.g. perl). We were wondering if there was a simple way of
> > applying these changes we wrote in Java across all the other languages.
> 
> If you add this support to Solr itself, it is then the responsibility of
> each client library to worry about supporting it.
> 
> You should only be focussing on the Solr DistributedUpdateHandler code
> rather than on any client libraries (other than the code you use as your
> test harness.
> 
> > 2. We are going to make an interface to handle distributed writing. We
> > plan for it to sit between the Solr server and the shards - if no shards
> > are specified, then the post.jar tool will work exactly the same way it
> > does now. However, if the user specifies shards for post.jar, then we
> > want a class that has extended our interface to kick into action. 
> 
> The interface you need will be a ShardPolicy or some such. You will hand
> to it a document, and a number of or list of shards, and it will tell
> you which shard that document should go in. This interface will then
> allow for pluggable shard policies, whether a simple modulo on the
> document ID (for deterministic indexing) or a simple round-robin (for
> random indexing).
> 
> You'll then need to split the documents you've gathered from the post
> request to the UpdateRequestHandler, and forward them to whichever
> shards the ShardPolicy suggested.
> 
> > 3. We plan to test our results by acceptance testing (we run Solr and
> > see if it works ourselves) and writing a test class.
> 
> Sounds great.
> 
> Upayavira
> --- 
> Enterprise Search Consultant at Sourcesense UK, 
> Making Sense of Open Source
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 
> 
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Distributed Indexing

Reply via email to