Hello to all of you!
I'm using Lucene to index millions a relatively small documents. In fact, I'm indexing logs from a transaction-based application. Each document represents what happened inside during 'transaction'. Each of them is composed by 5-6 main 'states' which are themselves composed of a couple of 'events'. The document structure is something like this: State1.event1.some_key=value State1.event1.another_key=another_value [...] State1.event4.another_key=yet another_value State2.event1.a_third_key=bla bla bla State3.event1. ... All in all, each document has between 10 and 250 fields. I can't fit this in a db because the nature of theses 'transactions' is quite dynamics and I can't think of a [simple/maintainable] database schema. That's why Lucene is so wonderful for this particular project. I have a super generic set of classes that enable me to generate any kind of reports I want. Really, it's wonderful. Now as you can imagine, indexing 'logs' means indexing really repetitive information. Some of the documents fields contain values like 'OK' 'failed' ... Others have more 'unique' values but all in all, there is a huge redundancy between all theses documents. Since I'm indexing about 20 millions documents per month, the size of the indices is ~35 gigs per month (that's the lower bound). I have no choice but to 'store' each field values (as well as indexing/tokenizing them) because I'll need to retrieve them in order to create various reports. Also, I have a backlog of ~2 years of logs to index! All this to ask: 1- is there someone out there that already wrote an extension to Lucene so that 'stored' string for each document/field is in fact stored in a centralized repository? Meaning, only an 'index' is actually stored in the document and the real data is put somewhere else. 2- If not, how hard would it be to write such extension? Which classes would need to be modified? FSDirectory? Document? 3- Any ideas on how else I could do this? I'm fully open to discussion! Thanks for your help! Jp _____________________________________________ JEAN-PHILIPPE ROBICHAUD Speech Scientist Professional Services NUANCE COMMUNICATIONS, INC. 1500 University, suite 935 Montreal, Quebec H3A 3S7 514 904 7800 Office 514 843 6872 Fax <http://www.nuance.com/> NUANCE.COM The experience speaks for itself (tm)