Hello to all of you!

 

I'm using Lucene to index millions a relatively small documents.  In fact,
I'm indexing logs from a transaction-based application.  Each document
represents what happened inside during 'transaction'.  Each of them is
composed by 5-6 main 'states' which are themselves composed of a couple of
'events'.  The document structure is something like this:

 

State1.event1.some_key=value

State1.event1.another_key=another_value

[...]

State1.event4.another_key=yet another_value

 

State2.event1.a_third_key=bla bla bla

State3.event1. ...

 

All in all, each document has between 10 and 250 fields.  I can't fit this
in a db because the nature of theses 'transactions' is quite dynamics and I
can't think of a [simple/maintainable] database schema.  That's why Lucene
is so wonderful for this particular project. I have a super generic set of
classes that enable me to generate any kind of reports I want.  Really, it's
wonderful.

 

Now as you can imagine, indexing 'logs' means indexing really repetitive
information.  Some of the documents fields contain values like 'OK' 'failed'
... Others have more 'unique' values but all in all, there is a huge
redundancy between all theses documents.  Since I'm indexing about 20
millions documents per month, the size of the indices is ~35 gigs per month
(that's the lower bound).  I have no choice but to 'store' each field values
(as well as indexing/tokenizing them) because I'll need to retrieve them in
order to create various reports.  Also, I have a backlog of ~2 years of logs
to index!

 

All this to ask: 

1-       is there someone out there that already wrote an extension to
Lucene so that 'stored' string for each document/field is in fact stored in
a centralized repository? Meaning, only an 'index' is actually stored in the
document and the real data is put somewhere else.

2-       If not, how hard would it be to write such extension?  Which
classes would need to be modified?  FSDirectory? Document? 

3-       Any ideas on how else I could do this?  I'm fully open to
discussion!

 

Thanks for your help!

 

Jp

 

_____________________________________________

 

JEAN-PHILIPPE ROBICHAUD

Speech Scientist Professional Services

 

NUANCE COMMUNICATIONS, INC.

1500 University, suite 935

Montreal, Quebec  H3A 3S7

 

 

514 904 7800  Office

514 843 6872  Fax

 <http://www.nuance.com/> NUANCE.COM

 

The experience speaks for itself (tm)

 

Reply via email to