[ https://issues.apache.org/jira/browse/TIKA-4438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17983630#comment-17983630 ]
Tilman Hausherr commented on TIKA-4438: --------------------------------------- There are two differences but these might be related to poi itself: govdocs1/676/676323.ppt: 19.819.8: 1 | cost: 1 | weight: 1 => 19.8: 2 | weightcost: 1, but this is because it's diagonal (Also govdocs1/009/009393.ppt, govdocs1/009/009392.ppt, govdocs1/011/011867.ppt) govdocs1/756/756943.ppt: exam: 1 | straight: 1 => e: 1 | ht: 1 | straig: 1 | xam: 1 I suspect this happens on page 84. But the words straight and exam are on different lines? Maybe this is related to some weird font characteristics like we sometimes have in PDFBox. Also commoncrawl3/HI/HIY63WATHM6GVAA6XDPE43NYDMMHGBDD commoncrawl3/35/35LR4EHASRFE3GQRIGAGKFGTPQCOGXCI govdocs1/541/541210.tmp However most of these files are better. > Prepare for 3.2.1 release > ------------------------- > > Key: TIKA-4438 > URL: https://issues.apache.org/jira/browse/TIKA-4438 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Priority: Major > Attachments: tika-3.2.1-reports.tgz > > -- This message was sent by Atlassian Jira (v8.20.10#820010)