PDFTextStream's jar is < 400K. This may not be relevant if Bill is
only interested in open-source options, but I thought I'd put it out
there anyway.
Chas Emerick
PDFTextStream: fast PDF text extraction for Java apps and Lucene
http://snowtide.com/home/PDFTextStream/
On Oct 17, 2004, at 12:51 AM, Ben Litchfield wrote:
The latest PDFBox jar is 2179K, as you point out is significantly
larger
than the jar in Parsnips. The majority of that space is used by cmap
mapping files used for proper text extraction so any classes that
could be
removed would only result in a minor size reduction. I would think
that
the capability of indexing PDF documents would outweigh the extra time
for
the download.
Ben
On Sat, 16 Oct 2004, Bill Tschumy wrote:
On Oct 16, 2004, at 9:47 PM, Ben Litchfield wrote:
types. It uses Lucene underneath. I'm thinking about extending it
in
the direction that Google Desktop is going and automatically index
certain file types and directories in your system.
And of course supporting PDF documents right!
Ben
http://www.pdfbox.org
Ahem... right... My next version will do a better job with PDF and
RTF files. I've looked at pdfBox, but the jar file is so big that I
hate to burden my users by incorporating it. Any chance of getting a
smaller version that just does the text extraction? Your jar file is
more than twice the size of my entire application including
documentation. I really would like to solve this problem.
--
Bill Tschumy
Otherwise -- Austin, TX
http://www.otherwise.com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]