Re: PDF / Word document parsers

Kelvin Tan Thu, 18 Apr 2002 23:05:20 -0700

Anita,

I've experienced a moderate amount of success using Etymon for PDF parsing.
It does consume quite alot of memory for larger PDF documents, but otherwise
it's ok. What difficulties are you facing?


For MS Word parsing, The Jakarta POI project is working something out, but
in the meanwhile I've managed to search MS Word documents by reading the
file and stripping out nonsense characters. It's a hack I think, but if I
increase the indexWriter's maxFieldLength to about a million, I can search
like 13-15MB word documents with ease.

Kelvin
----- Original Message -----
From: "Anita Srinivas" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, April 19, 2002 2:13 PM
Subject: PDF / Word document parsers


Hi...

I have been looking for PDF and Word document parsers.  I have tried the
contributions page on the Lucene site as suggested by a Lucene User. The
PJEtymon does not have a Windows version.  The XPDF does not do the parsing
very well.

Can someone  suggest some better Word document or PDF parsers other than the
ones I mentioned here, .

Thanks

Anita Srinivas



--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: PDF / Word document parsers

Reply via email to