Otis, Not sure I understand. Could you elaborate?
Note, content is not stored in the index itself. Hence my confusion to your suggestion. Thanks, Alex. On Mon, Oct 17, 2011 at 4:12 PM, Otis Gospodnetic <otis_gospodne...@yahoo.com> wrote: > Alex, > > You could try compressing the content field - that might help a bit. > > Otis > ---- > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > >>________________________________ >>From: Alex Shneyderman <a.shneyder...@gmail.com> >>To: general@lucene.apache.org >>Sent: Thursday, October 13, 2011 7:21 PM >>Subject: Suggestions or best practices for indexing the logs >> >>Hello, everybody! >> >>I am trying to introduce faster searches to our application that sifts >>through the logs. And Lucene seems to be the tool to use here. The one >>peculiarity of the problem it seems there are few files and they >>contain many log statements. I avoid storing the text in the index >>itself. Given all this I setup indexing as follows: >> >>I iterate over a log file and for each statement in the log file I do >>the indexing of the statements content. >> >>Here is the java code that does field additions: >> >> NumericField startOffset = new NumericField("so", >>Field.Store.YES, false); >> startOffset.setLongValue( statement.getStartOffset() ); >> doc.add(startOffset); >> >> NumericField endOffset = new NumericField("eo", >>Field.Store.YES, false); >> endOffset.setLongValue( statement.getEndOffset() ); >> doc.add(endOffset); >> >> NumericField timestampField = new NumericField("ts", >>Field.Store.YES, true); >> >>timestampField.setLongValue(statement.getStatementTime().getTime()); >> doc.add(timestampField); >> >> doc.add(new Field("fn", fileTagName, Field.Store.YES, >>Field.Index.NO )); >> doc.add(new Field("ct", statement.getContent(), >>Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.NO)); >> >>I am getting following results (index size vs log files) with this scheme: >> >>The size of the logs is 385MB. >>(00:13:08) /var/tmp/logs > du -ms /var/tmp/logs >>385 /var/tmp/logs >> >> >>The size of the index is 143MB. >>(00:41:26) /var/tmp/index > du -ms /var/tmp/index >>143 /var/tmp/index >> >>Is this a normal ratio 143Mb / 385 Mb - seems like it is a bit too >>much (I would expect something like 1/5 - 1/7 for the index)? Is there >>anything I can do to move this to the desired ration? Of course what >>would help is the words histogram and here the top of the output of >>the words histogram script that I ran on the logs: >> >>Total number of words: 26935271 >>Number of different words: 551981 >>The most common words are: >>as 3395203 >>10 797708 >>13 797662 >>2011 795595 >>at 787365 >>timer 746790 >>... >> >>Could anyone suggest a better way to organize index for my logs? And >>by better I mean more compact. Or this is as good as it gets? I tried >>to optimize and got a 2Mb improvement (index went from 145Mb to >>143Mb). >> >>Could anyone point to an article that deals with indexing of logs? Any >>help, suggestions and pointers are greatly appreciated. >> >>Thanks for any and all help and cheers, >>Alex. >> >> >>