[ 
https://issues.apache.org/jira/browse/TIKA-4438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17983630#comment-17983630
 ] 

Tilman Hausherr commented on TIKA-4438:
---------------------------------------

There are two differences but these might be related to poi itself:

govdocs1/676/676323.ppt: 19.819.8: 1 | cost: 1 | weight: 1 => 19.8: 2 | 
weightcost: 1, but this is because it's diagonal

(Also govdocs1/009/009393.ppt, govdocs1/009/009392.ppt, govdocs1/011/011867.ppt)


govdocs1/756/756943.ppt: exam: 1 | straight: 1 => e: 1 | ht: 1 | straig: 1 | 
xam: 1

I suspect this happens on page 84. But the words straight and exam are on 
different lines? Maybe this is related to some weird font characteristics like 
we sometimes have in PDFBox.


Also
commoncrawl3/HI/HIY63WATHM6GVAA6XDPE43NYDMMHGBDD
commoncrawl3/35/35LR4EHASRFE3GQRIGAGKFGTPQCOGXCI
govdocs1/541/541210.tmp

However most of these files are better.

> Prepare for 3.2.1 release
> -------------------------
>
>                 Key: TIKA-4438
>                 URL: https://issues.apache.org/jira/browse/TIKA-4438
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: tika-3.2.1-reports.tgz
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to