Alex, You could try compressing the content field - that might help a bit.
Otis ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ >________________________________ >From: Alex Shneyderman <a.shneyder...@gmail.com> >To: general@lucene.apache.org >Sent: Thursday, October 13, 2011 7:21 PM >Subject: Suggestions or best practices for indexing the logs > >Hello, everybody! > >I am trying to introduce faster searches to our application that sifts >through the logs. And Lucene seems to be the tool to use here. The one >peculiarity of the problem it seems there are few files and they >contain many log statements. I avoid storing the text in the index >itself. Given all this I setup indexing as follows: > >I iterate over a log file and for each statement in the log file I do >the indexing of the statements content. > >Here is the java code that does field additions: > > NumericField startOffset = new NumericField("so", >Field.Store.YES, false); > startOffset.setLongValue( statement.getStartOffset() ); > doc.add(startOffset); > > NumericField endOffset = new NumericField("eo", >Field.Store.YES, false); > endOffset.setLongValue( statement.getEndOffset() ); > doc.add(endOffset); > > NumericField timestampField = new NumericField("ts", >Field.Store.YES, true); > >timestampField.setLongValue(statement.getStatementTime().getTime()); > doc.add(timestampField); > > doc.add(new Field("fn", fileTagName, Field.Store.YES, >Field.Index.NO )); > doc.add(new Field("ct", statement.getContent(), >Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.NO)); > >I am getting following results (index size vs log files) with this scheme: > >The size of the logs is 385MB. >(00:13:08) /var/tmp/logs > du -ms /var/tmp/logs >385 /var/tmp/logs > > >The size of the index is 143MB. >(00:41:26) /var/tmp/index > du -ms /var/tmp/index >143 /var/tmp/index > >Is this a normal ratio 143Mb / 385 Mb - seems like it is a bit too >much (I would expect something like 1/5 - 1/7 for the index)? Is there >anything I can do to move this to the desired ration? Of course what >would help is the words histogram and here the top of the output of >the words histogram script that I ran on the logs: > >Total number of words: 26935271 >Number of different words: 551981 >The most common words are: >as 3395203 >10 797708 >13 797662 >2011 795595 >at 787365 >timer 746790 >... > >Could anyone suggest a better way to organize index for my logs? And >by better I mean more compact. Or this is as good as it gets? I tried >to optimize and got a 2Mb improvement (index went from 145Mb to >143Mb). > >Could anyone point to an article that deals with indexing of logs? Any >help, suggestions and pointers are greatly appreciated. > >Thanks for any and all help and cheers, >Alex. > > >