+1, as I'm usually in favour for using the java library.
Tilman
Am 18.02.2014 21:42, schrieb John Hewson:
The streams used by BaseParser and PDFParser are sequential, so you can ignore
them.
Use of PushBackInputStream in the non-sequential parser seems a little odd.
We might want to think about getting rid of the classes in org.apache.pdfbox.io
and replacing
them with classes from java.nio.channels. It looks like the PDFBox classes
pre-date NIO.
With NIO we could use memory mapped files, which for large PDFFiles will
perform better
than an InputStream.
-- John
On 18 Feb 2014, at 03:53, Maruan Sahyoun <[email protected]> wrote:
Hi,
there are currently a number of different options to use as a base for a
potential new parser/lexer. The ones currently in use are
BaseParser:
import org.apache.pdfbox.io.PushBackInputStream;
import org.apache.pdfbox.io.RandomAccess;
PDFParser (additional):
import org.apache.pdfbox.io.RandomAccess;
NonSequentialParser:
import org.apache.pdfbox.io.PushBackInputStream;
import org.apache.pdfbox.io.RandomAccess;
import org.apache.pdfbox.io.RandomAccessBuffer;
import org.apache.pdfbox.io.RandomAccessBufferedFileInputStream;
There are some additional Classes/Interfaces in the io package e.g.
RandomAccessBufferedFileInputStream implementing RandomAccessRead
Any preferences, ideas of consolidating this?
Currently I’m using RandomAccessBufferedFileInputStream with some additional
implementations of RandomAccessRead to support reading from a ByteArray for
testing purposes)
BR
Maruan Sahyoun