[Tika Wiki] Update of "PDFBOX_2_X_NOTES" by TimothyAllison

Apache Wiki Fri, 17 Jul 2015 06:58:23 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "PDFBOX_2_X_NOTES" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/PDFBOX_2_X_NOTES?action=diff&rev1=1&rev2=2

  == Character Encodings ==
   * I've noticed a handful of cases where ligatures in 1.8 are "spelled out" 
in 2.0 -- e.g. "identi[fi]cation" in 1.8 has become "identification" in 2.0 (at 
least in 003403.pdf from govdocs1).
  
+ == TIFF Extraction ==
+ Tiffs are no longer extracted by PDFBox without supplementary, non-Apache 
friendly libraries added to the classpath by consumers.  For now, with straight 
Tika+PDFBox, if "extractInlineImages" is set to true, and a TIFF is 
encountered, a zero-byte inputstream will be sent to the embedded (TIFF) 
parser.  This in turn throws an exception.  With the standard 
AutoDetectParser(), this embedded doc exception is caught, swallowed and 
ignored.  The RecursiveParserWrapper will catch these exceptions and allow 
users to see how many TIFFs they aren't getting, and allow users to see which 
files contain TIFFs.
+ 
+ To get a sense of the external libraries you'll need to add, take a look at 
this [[http://svn.apache.org/repos/asf/pdfbox/trunk/tools/pom.xml|pom]]
+

[Tika Wiki] Update of "PDFBOX_2_X_NOTES" by TimothyAllison

Reply via email to