[sorry for the long delay for my answer, we are having some issues with our mail server...]
Thanks for your comment. Yes it would make sense if the log files were not so big. In fact, I'm only indexing a subset of the log information. Because I store the information in Lucene, it is easier and faster to retrieve the information. For example, generating a report by reading the logs themselves takes ~18 hours. Converting to xml and indexing the logs takes ~12 hours and each report takes <20 minutes to generate. Since we often generate (different) reports, the Lucene approach is way faster. [Note that I wrote both classes to convert to xml and to index the logs. The log2xml step is quite useful because the xml really has all the log information and compressing the xml with xmlppm reduces the size of the logs by 97%. This way I can archive the log.xml without wasting much space.] As another comparison point: the logs takes 100Gig per week while the indices 35Gig per month! This design is already more optimal than the first approach. But I'm trying to make it better. I really do think that this dictionary/catalog approach could benefit others Lucene users. I'm not against the idea of doing it myself. I just need some pointers and guidelines for all the gurus out there! Thanks for all you help! Jp -----Original Message----- From: Doron Cohen [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 24, 2006 1:50 AM To: java-user@lucene.apache.org Subject: Re: "Catalog" backend for document stored fields? > I'm indexing logs from a transaction-based application. > ... > millions documents per month, the size of the indices is ~35 gigs per month > (that's the lower bound). I have no choice but to 'store' each field values > (as well as indexing/tokenizing them) because I'll need to retrieve them in > order to create various reports. Also, I have a backlog of ~2 years of logs > to index! > ... > 1- is there someone out there that already wrote an extension to > Lucene so that 'stored' string for each document/field is in fact stored in > a centralized repository? Meaning, only an 'index' is actually stored in the > document and the real data is put somewhere else. Do you gain anything from storing the document fields within Lucene? In case not, especially if log files are kept somewhere, you cuold make all 'content' fields unstored (reduce index size), and add a stored non-indexed ID field. It can also be a POINTER field - e.g. <log file name + start offset + length>. At search time, for found documents you can retrieve this ID/POINTER field and then fetch the document from the (original) log file. Makes sense? --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]