[Tika Wiki] Update of "ComparisonTikaAndPDFToText201811" by TimothyAllison

Apache Wiki Mon, 26 Nov 2018 15:05:19 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "ComparisonTikaAndPDFToText201811" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/ComparisonTikaAndPDFToText201811?action=diff&rev1=8&rev2=9

   2. Are there systematic areas for improvements in PDFBox for ''hi'' (-8.5%), 
''he'' (-5.1%) and Arabic script languages: ''ar'' (-18%), ''fa'' (-8%), ''ur'' 
(-74%)?
   3. Are there systematic areas for improvements in pdftotext CJK languages: 
''ja'' (4%), ''ko'' (3%), ''zh-cn'' (5%), ''zh-tw'' (0.8%)?
  
+ Most importantly, we need to determine if any of the above areas for inquiry 
are based on faults in tika-eval that should be fixed.
+ 
  = Overall improvements to this process =
   * The wrapper around pdftotext should have "caught" the exception written to 
stderr and stored that as we do with exceptions from Tika.
   * Tika currently includes the file's 'title' metadata in the content of the 
file.  This gives the misleading impression that some content was extracted 
from the file when, in fact, only the title was extracted from the XMP or 
metadata.  Next time, we should use a content handler that only includes the 
extracted text.

[Tika Wiki] Update of "ComparisonTikaAndPDFToText201811" by TimothyAllison

Reply via email to