Hi,

I have a little bit of an unusual set of requirements, and I am looking for 
advice. I have researched the archives, and seen some relevant posts, but they 
are fairly old and not specifically a match, so I thought I would give this a 
try.

We will eventually have about 50TB raw, non-searchable data and 25TB of search 
attributes to handle in Lucene, across about 1.25 trillion documents. The app 
is write once, read many. There are many document types involved that have to 
be able to be searched separately or together, with some common attributes, but 
also unique ones per type. I plan on using a JCP implementation that uses 
Lucene under the covers. The data itself is not searchable, only the 
attributes. I plan to hook the JCP repo (ModeShape) up to the OpenStack Object 
Storage on commodity hardware eventually with 5 machines, each with 24 x 2TB 
drives. This should allow for redundancy (3 copies), although I would suppose 
we would add bigger drives as we go on.

Since there is such a lot of data to index (not outrageous amounts for these 
days, but a bit chunky), I was sort of assuming that the Lucene indexes would 
go on the object storage solution too, to handle availability and other 
infrastructure issues. Most of the searches would be date-constrained, so I 
thought that the indexes could be sharded by date.

There would be a local disk index being built near real time on the JCP 
hardware that could be regularly merged in with the main indexes on the object 
storage, I suppose.

Does that make sense, and would it work? Sorry, but this is just theoretical at 
the moment and I'm not experienced in Lucene, as you can no doubt tell.

I came across a piece that was talking about Hardoop and distributed Solr, 
http://blog.mgm-tp.com/2010/09/hadoop-log-management-part4/, and I'm now 
wondering if that would be a superior approach? Or any other suggestions?

Many Thanks,
The Captn

Reply via email to