Re: best practice: 1.4 billions documents

Erick Erickson Mon, 22 Nov 2010 04:54:59 -0800

Are you looking at Solr? It has a lot of the infrastructure you'll be
building
yourself for Lucene already built in. Including replication, distributed
searching, etc. Yes, there's a learning curve for something new, but
your Lucene experience will help you a LOT with that. It has support
for sharding (which is what you'll certainly have to do to handle your
billion+ documents). Don't re-invent the wheel!!


In conjunction, see SolrJ which provides you a java interface to Solr which
may come in handy.

Start here: http://wiki.apache.org/solr/

Best
Erick

On Mon, Nov 22, 2010 at 1:46 AM, Luca Rondanini <luca.rondan...@gmail.com>wrote:

> Hi David, thanks for your answer. it really helped a lot! so, you have an
> index with more than 2 billions segments. this is pretty much the answer I
> was searching for: lucene alone is able to manage such a big index.
>
> which kind of problems do you have with the parallel searchers? I'm going
> to
> build my index in the next couple of weeks if you want we can confront our
> data
>
> thanks again
> Luca
>
>
> On Sun, Nov 21, 2010 at 6:22 PM, David Fertig <dfer...@cymfony.com> wrote:
>
> > Actually I've been bitten by an still-unresolved issue with the parallel
> > searchers and recommend a MultiReader instead.
> > We have a couple billion docs in our archives as well.  Breaking them up
> by
> > day worked well for us, but you'll need to do something.
> >
> > -----Original Message-----
> > From: Luca Rondanini [mailto:luca.rondan...@gmail.com]
> > Sent: Sunday, November 21, 2010 8:13 PM
> > To: java-user@lucene.apache.org; yo...@lucidimagination.com
> > Subject: Re: best practice: 1.4 billions documents
> >
> > thank you both!
> >
> > Johannes, katta seems interesting but I will need to solve the problems
> of
> > "hot" updates to the index
> >
> > Yonik, I see your point - so your suggestion would be to build an
> > architecture based on ParallelMultiSearcher?
> >
> >
> > On Sun, Nov 21, 2010 at 3:48 PM, Yonik Seeley <
> yo...@lucidimagination.com
> > >wrote:
> >
> > > On Sun, Nov 21, 2010 at 6:33 PM, Luca Rondanini
> > > <luca.rondan...@gmail.com> wrote:
> > > > Hi everybody,
> > > >
> > > > I really need some good advice! I need to index in lucene something
> > like
> > > 1.4
> > > > billions documents. I had experience in lucene but I've never worked
> > with
> > > > such a big number of documents. Also this is just the number of docs
> at
> > > > "start-up": they are going to grow and fast.
> > > >
> > > > I don't have to tell you that I need the system to be fast and to
> > support
> > > > real time updates to the documents
> > > >
> > > > The first solution that came to my mind was to use
> > ParallelMultiSearcher,
> > > > splitting the index into many "sub-index" (how many docs per index?
> > > > 100,000?) but I don't have experience with it and I don't know how
> well
> > > will
> > > > scale while the number of documents grows!
> > > >
> > > > A more solid solution seems to build some kind of integration with
> > > hadoop.
> > > > But I didn't find match about lucene and hadoop integration.
> > > >
> > > > Any idea? Which direction should I go (pure lucene or hadoop)?
> > >
> > > There seems to be a common misconception about hadoop regarding search.
> > > Map-reduce as implemented in hadoop is really for batch oriented jobs
> > > only (or those types of jobs where you don't need a quick response
> > > time).  It's definitely not for normal queries (unless you have
> > > unusual requirements).
> > >
> > > -Yonik
> > > http://www.lucidimagination.com
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
> >
>

Re: best practice: 1.4 billions documents

Reply via email to