Are you looking at Solr? It has a lot of the infrastructure you'll be building yourself for Lucene already built in. Including replication, distributed searching, etc. Yes, there's a learning curve for something new, but your Lucene experience will help you a LOT with that. It has support for sharding (which is what you'll certainly have to do to handle your billion+ documents). Don't re-invent the wheel!!
In conjunction, see SolrJ which provides you a java interface to Solr which may come in handy. Start here: http://wiki.apache.org/solr/ Best Erick On Mon, Nov 22, 2010 at 1:46 AM, Luca Rondanini <luca.rondan...@gmail.com>wrote: > Hi David, thanks for your answer. it really helped a lot! so, you have an > index with more than 2 billions segments. this is pretty much the answer I > was searching for: lucene alone is able to manage such a big index. > > which kind of problems do you have with the parallel searchers? I'm going > to > build my index in the next couple of weeks if you want we can confront our > data > > thanks again > Luca > > > On Sun, Nov 21, 2010 at 6:22 PM, David Fertig <dfer...@cymfony.com> wrote: > > > Actually I've been bitten by an still-unresolved issue with the parallel > > searchers and recommend a MultiReader instead. > > We have a couple billion docs in our archives as well. Breaking them up > by > > day worked well for us, but you'll need to do something. > > > > -----Original Message----- > > From: Luca Rondanini [mailto:luca.rondan...@gmail.com] > > Sent: Sunday, November 21, 2010 8:13 PM > > To: java-user@lucene.apache.org; yo...@lucidimagination.com > > Subject: Re: best practice: 1.4 billions documents > > > > thank you both! > > > > Johannes, katta seems interesting but I will need to solve the problems > of > > "hot" updates to the index > > > > Yonik, I see your point - so your suggestion would be to build an > > architecture based on ParallelMultiSearcher? > > > > > > On Sun, Nov 21, 2010 at 3:48 PM, Yonik Seeley < > yo...@lucidimagination.com > > >wrote: > > > > > On Sun, Nov 21, 2010 at 6:33 PM, Luca Rondanini > > > <luca.rondan...@gmail.com> wrote: > > > > Hi everybody, > > > > > > > > I really need some good advice! I need to index in lucene something > > like > > > 1.4 > > > > billions documents. I had experience in lucene but I've never worked > > with > > > > such a big number of documents. Also this is just the number of docs > at > > > > "start-up": they are going to grow and fast. > > > > > > > > I don't have to tell you that I need the system to be fast and to > > support > > > > real time updates to the documents > > > > > > > > The first solution that came to my mind was to use > > ParallelMultiSearcher, > > > > splitting the index into many "sub-index" (how many docs per index? > > > > 100,000?) but I don't have experience with it and I don't know how > well > > > will > > > > scale while the number of documents grows! > > > > > > > > A more solid solution seems to build some kind of integration with > > > hadoop. > > > > But I didn't find match about lucene and hadoop integration. > > > > > > > > Any idea? Which direction should I go (pure lucene or hadoop)? > > > > > > There seems to be a common misconception about hadoop regarding search. > > > Map-reduce as implemented in hadoop is really for batch oriented jobs > > > only (or those types of jobs where you don't need a quick response > > > time). It's definitely not for normal queries (unless you have > > > unusual requirements). > > > > > > -Yonik > > > http://www.lucidimagination.com > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > >