[Tika Wiki] Update of "PDFBOX_2_X_NOTES" by TimothyAllison

Apache Wiki Thu, 16 Jul 2015 12:34:03 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "PDFBOX_2_X_NOTES" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/PDFBOX_2_X_NOTES

New page:
= Upgrading to PDFBox 2.0 =

PDFBox is on its way to releasing 2.0.  This version represents a major shift 
from the 1.8.x branch.  We'll document some expected differences from a 
user/consumer perspective on the upgrade.  The issue to track progress is 
[[https://issues.apache.org/jira/browse/TIKA-1285|TIKA-1285]].

== NonseqParser ==
With 2.x, the older parser is gone, and the NonSequential parser is the 
main/only parser available.  In 1.8.x, users of Tika can configure the use of 
the NonSequential parser via the config file.  This choice will disappear in 
2.x.

== Speed/Memory ==
This is still in a state of flux.  With some changes over the last few days, 
the speed appears to be equivalent between 1.8.x and the non-sequential parser 
and 2.x -- that said, the speed is slightly slower with the nonsequential 
parser (TODO: benchmarks);

Memory use configuration is currently going through some upgrades. It looks 
like the clients will be able to set a threshold and PDFBox will choose when to 
buffer to disk to stay under the desired memory threshold.

== Character Encodings ==
 * I've noticed a handful of cases where ligatures in 1.8 are "spelled out" in 
2.0 -- e.g. "identi[fi]cation" in 1.8 has become "identification" in 2.0 (at 
least in 003403.pdf from govdocs1).

[Tika Wiki] Update of "PDFBOX_2_X_NOTES" by TimothyAllison

Reply via email to