Thanks Eric, Yes, the limitations you pointed confirm my first feeling on it. Even if it is doable with Solr or Lucene, I would have to go deep inside of it to get the most out of it.
About my RDBMS issues... there are 2 reasons: First, Im interested in this whole cloud crazyness. I love to work with Azure, and try a different approach. In this case, I was thinking in storing the data in Data Tables, and have several Indexers. Then, while 100Gb is fine for a SQL server, if it grows to 200 or 300 Gb its becomes too expensive for a small open source project. On the other hand, Data Tables in Azure are much more affordable. Still expensive, but on another scale. On Wed, Feb 15, 2012 at 9:48 PM, Erick Erickson <erickerick...@gmail.com> wrote: > Actually, you might well have your index be larger than your source, assuming > you're going to be both storing and indexing everything. > > There's also the "deep paging" issue, see: > https://issues.apache.org/jira/browse/SOLR-1726 > which comes into play if you expect to return a lot of rows. > Solr really doesn't have the "cursor" concept as RDBMSs do. > > My gut feeling is that solr is a *text* search engine primarily and > this feels like something more suited to an RDBMS. That said, > I'm quite sure you can make Solr/Lucene do the tricks you want > if you're really RDBMS-averse <G>.... > > And at that size, you may well have to deal with sharding the > index (you'd have to test).. > > I guess my "bottom line" is that you could get Solr up and running, > index the data and just see in a few days with data that size. > > Best > Erick > > On Wed, Feb 15, 2012 at 1:04 PM, Pedro Ferreira > <psilvaferre...@gmail.com> wrote: >> Hi guys, >> >> I hope I'm sending this to the right place. >> >> I have this possible idea in mind (still fuzzy, but enough to describe >> this), and I was wondering if Lucene or Solr could help in this. I've >> implemented a Lucene index on custom enterprise data before and have >> it running on Azure as well, so I know the basics of it. >> >> For this idea, this are the premises: >> >> - about 100Gb of data >> - data is expected to be in one gigantic table. conceptually, is like >> a spreadsheet table: rows are objects and columns are properties. >> - values are mostly floating point numbers, and I expect them to be, >> let's say, unique, or almost randomly distributed (1.89868776E+50, >> 1.434E-12) >> - The data is readonly. it will never change. >> >> Now I need to query this data based mostly in range queries on the >> columns. Something like: >> >> "SELECT * FROM Table WHERE (Col1 > 1.2E2 AND Col1 < 1.8E2) OR (Col3 == 0)" >> >> which is basically "give me all the rows that satisfy this criteria". >> >> I believe this could be easily done with a standard RDBMS, but I would >> like to avoid that route. >> >> So, is this someething doable with Lucene or Solr? And if so, how much >> can be done with a stock, out of the box Lucene implementation? >> >> While thinking about this, and assuming this could work well with >> Lucene, I had 2 major questions: >> >> - Won't I get an index that will be pretty much the same size of the >> data source? I would have to index all columns from all rows, and as >> there is not much "repetition" in the data source, wouldn't the index >> almost mirror the data source?. >> >> - If the data source is readonly, should I be creating the index once, >> offline, and the replicate it to the search servers? >> >> Or am I just being crazy and making a monster of a small problem? :) >> >> Thanks >> -- >> Pedro Ferreira >> >> mobile: 00 44 7712 557303 >> skype: pedrosilvaferreira >> email: psilvaferre...@gmail.com >> linkedin: http://uk.linkedin.com/in/pedrosilvaferreira >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > -- Pedro Ferreira mobile: 00 44 7712 557303 skype: pedrosilvaferreira email: psilvaferre...@gmail.com linkedin: http://uk.linkedin.com/in/pedrosilvaferreira --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org