Re: Distributed Indexing

Upayavira Tue, 01 Feb 2011 03:28:29 -0800

On Tue, 01 Feb 2011 00:26 +0000, "William Mayor"
<m...@williammayor.co.uk> wrote:
> Hi Guys
> 
> I've had a go at creating the ShardDistributionPolicy interface and a
> few implementations. I've created a patch
> (https://issues.apache.org/jira/browse/SOLR-2341) let me know what
> needs doing.



> Currently I assume that the documents passed to the policy will be
> represented by some kind of identifier and that one needs only to
> match the ID with a shard. This is better (I think) than reading the
> document from the POST and figuring out some kind of unique
> identifier?

Your code looks fine to me, except it should take in a SolrDocument
object or list of, rather than strings. Then, for your Hash version, you
can take a hash of the "id" field.

> A question we've had about this is who decides what policy to use and
> where do they specify? I'm inclided to think that the user (the person
> POSTing data) does not mind what policy is used but the administrator
> might. This leads me to think that the policy should be set in the
> solr config file? My collegues disagree that the user will not mind
> and would rather see the policy be specified in the url. We've noticed
> that request handlers can be specified in both so should we adopt this
> idea instead (and as a kind of comprimise :) ).

To stick with Solr conventions, you would specify the
ShardDistributionPolicy in the solrconfig.xml, within the configuration
of your DistributedUpdateRequestHandler, so in that sense, it is hidden
from your users and managed by the administrator.

However, if you follow this approach, an administrator could expose
multiple policies by having multiple DistributedUpdateRequestHandler
definitions in solrconfig.xml, with different URLs.

To give you an example, but for search rather than indexing:

  <requestHandler name="/dismax" class="solr.SearchHandler"
  default="true">
    <!-- default values for query parameters -->
     <lst name="defaults">
       <str name="defType">dismax</str>
     </lst>
  </requestHandler>

This will configure requests to http://localhost:8983/solr/dismax?q=blah

to be handled by the dismax query parser.

More relevant to you:

  <requestHandler name="/distrib" class="solr.SearchHandler"
  default="true">
    <!-- default values for query parameters -->
     <lst name="defaults">
       <str
       name="shards">http://localhost:8983/solr,http://localhost:7983/solr</str>
     </lst>
  </requestHandler>

This would, by default, distribute all queries to
http://localhost:8983/solr/distrib?q=blah across two Solr instances at
the URLs described.

For now, I'd say see if you can add a
distributionPolicyClass="org.apache.solr.blah" to define the class that
this updateRequestHandler is going to use.

To everyone else who got this far - please chip in if you see better
ways of doing this.

Upayavira

> All the best
> 
> William
> 
> On Sat, Jan 29, 2011 at 11:56 PM, Upayavira <u...@odoko.co.uk> wrote:
> > Lance,
> >
> > Firstly, we're proposing a ShardDistributionPolicy interface for which
> > there is a default (mod of the doc ID) but other implementations are
> > possible. Another easy implementation would be a randomised or round
> > robin one.
> >
> > As to threading, the first task would be to put all of the source
> > documents into "buckets", one bucket per shard, using the above
> > ShardDistributionPolicy to assign documents to buckets/shards. Then all
> > of the documents in a "bucket" could be sent to the relevant shard for
> > indexing (which would be nothing more than a normal HTTP post (or solrj
> > call?)).
> >
> > As to whether this would be single threaded or multithreaded, I would
> > guess we would aim to do it the same as the distributed search code
> > (which I have not yet reviewed). However, it could presumably be
> > single-threaded, but use asynchronous HTTP.
> >
> > Regards, Upayavira
> >
> > On Sat, 29 Jan 2011 15:09 -0800, "Lance Norskog" <goks...@gmail.com>
> > wrote:
> >> I would suggest that a DistributedRequestUpdateHandler run
> >> single-threaded, doing only one document at a time. If I want more
> >> than one, I run it twice or N times with my own program.
> >>
> >> Also, this should have a policy object which decides exactly how
> >> documents are distributed. There are different techniques for
> >> different use cases.
> >>
> >> Lance
> >>
> >> On Sat, Jan 29, 2011 at 12:34 PM, Soheb Mahmood <soheb.luc...@gmail.com>
> >> wrote:
> >> > Hello Yonik,
> >> >
> >> > On Thu, 2011-01-27 at 08:01 -0500, Yonik Seeley wrote:
> >> >> Making it easy for clients I think is key... one should be able to
> >> >> update any node in the solr cluster and have solr take care of the
> >> >> hard part about updating all relevant shards.  This will most likely
> >> >> involve an update processor.  This approach allows all existing update
> >> >> methods (including things like CSV file upload) to still work
> >> >> correctly.
> >> >>
> >> >> Also post.jar is really just for testing... a command-line replacement
> >> >> for "curl" for those who may not have it.  It's not really a
> >> >> recommended way for updating Solr servers in production.
> >> >
> >> > OK, I've abandoned the post.jar tool idea in favour of a
> >> > DistributedUpdateRequestProcessor class (I've been looking into other
> >> > classes like UpdateRequestProcessor, RunUpdateRequestProcessor,
> >> > SignatureUpdateProcessorFactory, and SolrQueryRequest to see how they
> >> > are used/what data they store - hence why I've taken some time to
> >> > respond).
> >> >
> >> > My big question now is that is it necessary to have a Factory class for
> >> > DistributedUpdateRequestProcessor? I've seen this lots of times, as in
> >> > RunUpdateProcessorFactory (where the factory class was only a few lines
> >> > of code) to SignatureUpdateProcessorFactory? At first I was thinking it
> >> > would be a good design idea to include it in (in a generic sense), but
> >> > then I thought harder and I thought that the
> >> > DistributedUpdateRequestHander would only be running once, taking in all
> >> > the requests, so it seems sort of pointless to write one in.
> >> >
> >> > That is my "burning" question for now. I have got a few more questions,
> >> > but I'm sure that when I look further into the code, I'll either have
> >> > more or all of my questions are answered.
> >> >
> >> > Many thanks!
> >> >
> >> > Soheb Mahmood
> >> >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Lance Norskog
> >> goks...@gmail.com
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
> > ---
> > Enterprise Search Consultant at Sourcesense UK,
> > Making Sense of Open Source
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Distributed Indexing

Reply via email to