Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "PDFBOX_2_X_NOTES" page has been changed by TimothyAllison: https://wiki.apache.org/tika/PDFBOX_2_X_NOTES New page: = Upgrading to PDFBox 2.0 = PDFBox is on its way to releasing 2.0. This version represents a major shift from the 1.8.x branch. We'll document some expected differences from a user/consumer perspective on the upgrade. The issue to track progress is [[https://issues.apache.org/jira/browse/TIKA-1285|TIKA-1285]]. == NonseqParser == With 2.x, the older parser is gone, and the NonSequential parser is the main/only parser available. In 1.8.x, users of Tika can configure the use of the NonSequential parser via the config file. This choice will disappear in 2.x. == Speed/Memory == This is still in a state of flux. With some changes over the last few days, the speed appears to be equivalent between 1.8.x and the non-sequential parser and 2.x -- that said, the speed is slightly slower with the nonsequential parser (TODO: benchmarks); Memory use configuration is currently going through some upgrades. It looks like the clients will be able to set a threshold and PDFBox will choose when to buffer to disk to stay under the desired memory threshold. == Character Encodings == * I've noticed a handful of cases where ligatures in 1.8 are "spelled out" in 2.0 -- e.g. "identi[fi]cation" in 1.8 has become "identification" in 2.0 (at least in 003403.pdf from govdocs1).
