Actually, you might well have your index be larger than your source, assuming you're going to be both storing and indexing everything.
There's also the "deep paging" issue, see: https://issues.apache.org/jira/browse/SOLR-1726 which comes into play if you expect to return a lot of rows. Solr really doesn't have the "cursor" concept as RDBMSs do. My gut feeling is that solr is a *text* search engine primarily and this feels like something more suited to an RDBMS. That said, I'm quite sure you can make Solr/Lucene do the tricks you want if you're really RDBMS-averse <G>.... And at that size, you may well have to deal with sharding the index (you'd have to test).. I guess my "bottom line" is that you could get Solr up and running, index the data and just see in a few days with data that size. Best Erick On Wed, Feb 15, 2012 at 1:04 PM, Pedro Ferreira <psilvaferre...@gmail.com> wrote: > Hi guys, > > I hope I'm sending this to the right place. > > I have this possible idea in mind (still fuzzy, but enough to describe > this), and I was wondering if Lucene or Solr could help in this. I've > implemented a Lucene index on custom enterprise data before and have > it running on Azure as well, so I know the basics of it. > > For this idea, this are the premises: > > - about 100Gb of data > - data is expected to be in one gigantic table. conceptually, is like > a spreadsheet table: rows are objects and columns are properties. > - values are mostly floating point numbers, and I expect them to be, > let's say, unique, or almost randomly distributed (1.89868776E+50, > 1.434E-12) > - The data is readonly. it will never change. > > Now I need to query this data based mostly in range queries on the > columns. Something like: > > "SELECT * FROM Table WHERE (Col1 > 1.2E2 AND Col1 < 1.8E2) OR (Col3 == 0)" > > which is basically "give me all the rows that satisfy this criteria". > > I believe this could be easily done with a standard RDBMS, but I would > like to avoid that route. > > So, is this someething doable with Lucene or Solr? And if so, how much > can be done with a stock, out of the box Lucene implementation? > > While thinking about this, and assuming this could work well with > Lucene, I had 2 major questions: > > - Won't I get an index that will be pretty much the same size of the > data source? I would have to index all columns from all rows, and as > there is not much "repetition" in the data source, wouldn't the index > almost mirror the data source?. > > - If the data source is readonly, should I be creating the index once, > offline, and the replicate it to the search servers? > > Or am I just being crazy and making a monster of a small problem? :) > > Thanks > -- > Pedro Ferreira > > mobile: 00 44 7712 557303 > skype: pedrosilvaferreira > email: psilvaferre...@gmail.com > linkedin: http://uk.linkedin.com/in/pedrosilvaferreira > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org