If you are using http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#getText(o rg.pdfbox.pdmodel.PDDocument), you are going to get a large String and may need a 1G heap.
If, however, you are using http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#writeText (org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a temporary file, you will not need so much RAM, but you need to use http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html #Field(java.lang.String,%20java.io.Reader) to construct your Lucene field (rather than http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html #Field(java.lang.String,%20java.lang.String,%20org.apache.lucene.document.Fi eld.Store,%20org.apache.lucene.document.Field.Index)). -----Original Message----- From: Suba Suresh [mailto:[EMAIL PROTECTED] Sent: 13 July 2006 14:55 To: java-user@lucene.apache.org Subject: Out of memory error I am indexing different document formats with lucene 1.9. One of the pdf file I am indexing is 300MG. Whenever the index writer hits that file it stops the indexing with "Out of Memory" exception. I am using the pdf box library to index. I have set the following merge factors in my code. writer.setMergeFactor(1000); writer.setMaxMergeDocs(9999999); writer.setMaxBufferedDocs(1000); writer.setMaxFieldLength(Integer.MAX_VALUE); I would like any help and suggestions. thanks, suba suresh. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
smime.p7s
Description: S/MIME cryptographic signature