Re: Indexing 100Gb of readonly numeric data

Erick Erickson Wed, 15 Feb 2012 13:48:58 -0800

Actually, you might well have your index be larger than your source, assuming
you're going to be both storing and indexing everything.


There's also the "deep paging" issue, see:
https://issues.apache.org/jira/browse/SOLR-1726
which comes into play if you expect to return a lot of rows.
Solr really doesn't have the "cursor" concept as RDBMSs do.

My gut feeling is that solr is a *text* search engine primarily and
this feels like something more suited to an RDBMS. That said,
I'm quite sure you can make Solr/Lucene do the tricks you want
if you're really RDBMS-averse <G>....

And at that size, you may well have to deal with sharding the
index (you'd have to test)..

I guess my "bottom line" is that you could get Solr up and running,
index the data and just see in a few days with data that size.

Best
Erick

On Wed, Feb 15, 2012 at 1:04 PM, Pedro Ferreira
<psilvaferre...@gmail.com> wrote:
> Hi guys,
>
> I hope I'm sending this to the right place.
>
> I have this possible idea in mind (still fuzzy, but enough to describe
> this), and I was wondering if Lucene or Solr could help in this. I've
> implemented a Lucene index on custom enterprise data before and have
> it running on Azure as well, so I know the basics of it.
>
> For this idea, this are the premises:
>
> - about 100Gb of data
> - data is expected to be in one gigantic table. conceptually, is like
> a spreadsheet table: rows are objects and columns are properties.
> - values are mostly floating point numbers, and I expect them to be,
> let's say, unique, or almost randomly distributed (1.89868776E+50,
> 1.434E-12)
> - The data is readonly. it will never change.
>
> Now I need to query this data based mostly in range queries on the
> columns. Something like:
>
> "SELECT * FROM Table WHERE (Col1 > 1.2E2 AND Col1 < 1.8E2) OR (Col3 == 0)"
>
> which is basically "give me all the rows that satisfy this criteria".
>
> I believe this could be easily done with a standard RDBMS, but I would
> like to avoid that route.
>
> So, is this someething doable with Lucene or Solr? And if so, how much
> can be done with a stock, out of the box Lucene implementation?
>
> While thinking about this, and assuming this could work well with
> Lucene, I had 2 major questions:
>
> - Won't I get an index that will be pretty much the same size of the
> data source? I would have to index all columns from all rows, and as
> there is not much "repetition" in the data source, wouldn't the index
> almost mirror the data source?.
>
> - If the data source is readonly, should I be creating the index once,
> offline, and the replicate it to the search servers?
>
> Or am I just being crazy and making a monster of a small problem? :)
>
> Thanks
> --
> Pedro Ferreira
>
> mobile: 00 44 7712 557303
> skype: pedrosilvaferreira
> email: psilvaferre...@gmail.com
> linkedin: http://uk.linkedin.com/in/pedrosilvaferreira
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Indexing 100Gb of readonly numeric data

Reply via email to