On Tue, Oct 14, 2014 at 1:29 AM, Trejkaz <trej...@trypticon.org> wrote:
> Bit of thread necromancy here, but I figured it was relevant because > we get exactly the same error. Wow, blast from the past ... >> Is it possible you are indexing an absurdly enormous document...? > > We're seeing a case here where the document certainly could qualify as > "absurdly enormous". The doc itself is 2GB in size and the > tokenisation is per-character, not per-word, so the number of > generated terms must be enormous. Probably enough to fill 2GB... > > So I'm wondering if there is more info somewhere on why this is (or > was? We're still using 3.6.x) a limit and whether it can be detected > up-front. Some large amount of indexing time (~30 minutes) could be > avoided if we can detect that it would have failed ahead of time. The limit is still there; it's because Lucene uses an int internally to address its memory buffer. It's probably easiest to set a limit on the max sized doc you will index? Or, use LimitTokenCountFilter (available in newer releases) to only index the first N tokens... Mike McCandless http://blog.mikemccandless.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org