Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "ComparisonTikaAndPDFToText201811" page has been changed by TimothyAllison: https://wiki.apache.org/tika/ComparisonTikaAndPDFToText201811?action=diff&rev1=8&rev2=9 2. Are there systematic areas for improvements in PDFBox for ''hi'' (-8.5%), ''he'' (-5.1%) and Arabic script languages: ''ar'' (-18%), ''fa'' (-8%), ''ur'' (-74%)? 3. Are there systematic areas for improvements in pdftotext CJK languages: ''ja'' (4%), ''ko'' (3%), ''zh-cn'' (5%), ''zh-tw'' (0.8%)? + Most importantly, we need to determine if any of the above areas for inquiry are based on faults in tika-eval that should be fixed. + = Overall improvements to this process = * The wrapper around pdftotext should have "caught" the exception written to stderr and stored that as we do with exceptions from Tika. * Tika currently includes the file's 'title' metadata in the content of the file. This gives the misleading impression that some content was extracted from the file when, in fact, only the title was extracted from the XMP or metadata. Next time, we should use a content handler that only includes the extracted text.