You should consider making each _line_ of the log file a (Lucene) document (assuming it is a log-per-line log file)
-Glen On Fri, Feb 14, 2014 at 4:12 PM, John Cecere <john.cec...@oracle.com> wrote: > I'm not sure in today's world I would call 2GB 'immense' or 'enormous'. At > any rate, I don't have control over the size of the documents that go into > my database. Sometimes my customer's log files end up really big. I'm > willing to have huge indexes for these things. > > Wouldn't just changing from int to long for the offsets solve the problem ? > I'm sure it would probably have to be changed in a lot of places, but why > impose such a limitation ? Especially since it's using an InputStream and > only dealing with a block of data at a time. > > I'll take a look at your suggestion. > > Thanks, > John > > > On 2/14/14 3:20 PM, Michael McCandless wrote: >> >> Hmm, why are you indexing such immense documents? >> >> In 3.x Lucene never sanity checked the offsets, so we would silently >> index negative (int overflow'd) offsets into e.g. term vectors. >> >> But in 4.x, we now detect this and throw the exception you're seeing, >> because it can lead to index corruption when you index the offsets >> into the postings. >> >> If you really must index such enormous documents, maybe you could >> create a custom tokenizer (derived from StandardTokenizer) that >> "fixes" the offset before setting them? Or maybe just doesn't even >> set them. >> >> Note that position can also overflow, if your documents get too large. >> >> >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> >> On Fri, Feb 14, 2014 at 1:36 PM, John Cecere <john.cec...@oracle.com> >> wrote: >>> >>> I'm having a problem with Lucene 4.5.1. Whenever I attempt to index a >>> file > >>> 2GB in size, it dies with the following exception: >>> >>> java.lang.IllegalArgumentException: startOffset must be non-negative, and >>> endOffset must be >= startOffset, >>> startOffset=-2147483648,endOffset=-2147483647 >>> >>> Essentially, I'm doing this: >>> >>> Directory directory = new MMapDirectory(indexPath); >>> Analyzer analyzer = new StandardAnalyzer(); >>> IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45, >>> analyzer); >>> IndexWriter iw = new IndexWriter(directory, iwc); >>> >>> InputStream is = <my input stream>; >>> InputStreamReader reader = new InputStreamReader(is); >>> >>> Document doc = new Document(); >>> doc.add(new StoredField("fileid", fileid)); >>> doc.add(new StoredField("pathname", pathname)); >>> doc.add(new TextField("content", reader)); >>> >>> iw.addDocument(doc); >>> >>> It's the IndexWriter addDocument method that throws the exception. In >>> looking at the Lucene source code, it appears that the offsets being used >>> internally are int, which makes it somewhat obvious why this is >>> happening. >>> >>> This issue never happened when I used Lucene 3.6.0. 3.6.0 was perfectly >>> capable of handling a file over 2GB in this manner. What has changed and >>> how >>> do I get around this ? Is Lucene no longer capable of handling files this >>> large, or is there some other way I should be doing this ? >>> >>> Here's the full stack trace sans my code: >>> >>> java.lang.IllegalArgumentException: startOffset must be non-negative, and >>> endOffset must be >= startOffset, >>> startOffset=-2147483648,endOffset=-2147483647 >>> at >>> >>> org.apache.lucene.analysis.tokenattributes.OffsetAttributeImpl.setOffset(OffsetAttributeImpl.java:45) >>> at >>> >>> org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:183) >>> at >>> >>> org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49) >>> at >>> >>> org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) >>> at >>> >>> org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82) >>> at >>> >>> org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) >>> at >>> >>> org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248) >>> at >>> >>> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:254) >>> at >>> >>> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:446) >>> at >>> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1551) >>> at >>> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1221) >>> at >>> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1202) >>> >>> Thanks, >>> John >>> >>> -- >>> John Cecere >>> Principal Engineer - Oracle Corporation >>> 732-987-4317 / john.cec...@oracle.com >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > -- > John Cecere > Principal Engineer - Oracle Corporation > 732-987-4317 / john.cec...@oracle.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org