I use PDFBox version 1.1.0; I did find a workaround now. Just wondering which tools do you use to extract text from pdf? Thanks.
On Wed, Oct 13, 2010 at 11:36 AM, Fabiano Nunes <fabi...@nunes.me> wrote: > What version of PDFBox are you running? > PDFBox 0.72 does not work properly with some pdf documents. See more in > https://issues.apache.org/jira/browse/PDFBOX-361. > So, I wrote a extractor (a copy of the original, in fact) based on trunk > version (1.2.1, actually). Furthermore, this version is faster e more > efficient than 0.72, preventing from OutOfMemory. Nowadays, I don't use > PDFBox to extract text. I prefer poppler tools. > > On Wed, Oct 13, 2010 at 2:22 PM, Ching <zchin...@gmail.com> wrote: > > > Hi, > > > > Thank you for your suggestions. I found the reason which is that PDFBox > > seems having problem parsing large document (20MB), I have a few of them > > within those 2000 docs, those are the ones throwing OutOfMemory errors. > The > > app does exit, and JVM died. I am running on 32bit machine. > > > > -- Ching > > > > On Tue, Oct 12, 2010 at 9:42 PM, Anshum <ansh...@gmail.com> wrote: > > > > > Hi Ching, > > > Does the app exit or hang and stay there? as in does the JVM stay alive > > and > > > idle? > > > Also, can you make sure that its not the pdfbox? as in, try commenting > > the > > > indexwriter part and just read the pdfs, does that work fine. > > > Can you also post info on your environment? > > > Index Size? Lucene Version? Machine and JVM (32/64 bit)? > > > This most probably seems like a code level issue rather than lucene, > but > > I > > > may be wrong. > > > > > > -- > > > Anshum Gupta > > > http://ai-cafe.blogspot.com > > > > > > > > > On Wed, Oct 13, 2010 at 8:08 AM, Ching <zchin...@gmail.com> wrote: > > > > > > > Hi All, > > > > > > > > Can anyone help with this issue? I have about 2000 pdf files that I > use > > > > PDFBox to extract its text, then index them using for loop. The > > indexing > > > > stopped after the fdt file reaches at 7,061 KB in size. There is no > > > error, > > > > the indexing simply stopped. Thanks in advance for any help. > > > > > > > > Ching > > > > > > > > > >