Hi! I'm trying to make an index of several text documents. Their content is just field tab-separated strings: word<\t>w1<\t>w2<\t>...<\t>wn pos<\t>pos1<\t>pos2_a:pos2_b:pos2_c<\t>...<\t>posn_a:posn_b ...
There are 5 documents with the total of 10 MB in size. While indexing, java uses about 2 GB of RAM and finally thows an OOM error. String join_token = tok.nextToken(); // atomic tokens correspond to separate parses String[] atomic_tokens = StringUtils.split(join_token, ':'); // marking each token with the parse number for (int token_index = 0; token_index < atomic_tokens.length; ++token_index) { atomic_tokens[token_index] += String.format("|%d", token_index); } String join_token_with_payloads = StringUtils.join(atomic_tokens, " "); >>>> TokenStream stream = new WhitespaceTokenizer(Version.LUCENE_41, >>>> <<<< the line where the leak appears new StringReader(join_token_with_payloads)); // all these parses belong to the same position in the document stream = new PositionFilter(stream, 0); stream = new DelimitedPayloadTokenFilter(stream, '|', new IntegerEncoder()); stream.addAttribute(OffsetAttribute.class); stream.addAttribute(CharTermAttribute.class); feature = new Field(name, join_token, attributeFieldType); feature.setTokenStream(stream); inDocument.add(feature); What is wrong with this code from the memory point of view, and how to do indexing with as little data as possible held in RAM? -- Best Regards, Igor Shalyminov --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org