[jira] [Updated] (TIKA-4231) Parsing Arabic PDF is returning bad data

Aamir (Jira) Fri, 29 Mar 2024 12:06:49 -0700


     [ 
https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Aamir updated TIKA-4231:
------------------------
    Description: 
Attached is a PDF with arabic text in it. 
When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
characters. 

The generated text doc is also attached which contains the parsed text. 

Most of the other Arabic PDFs parse fine, but this one is giving this output. 

  was:
Attached is a PDF with arabic text in it. 
When parsed using tika version 2.6.0, it produces gibberish characters. 

The generated text doc is also attached which contains the parsed text. 

Most of the other Arabic PDFs parse fine, but this one is giving this output. 


> Parsing Arabic PDF is returning bad data
> ----------------------------------------
>
>                 Key: TIKA-4231
>                 URL: https://issues.apache.org/jira/browse/TIKA-4231
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 2.6.0, 2.9.1
>         Environment: I am using Java 18. And using maven dependency 
> tika-parsers-standard-package 
> ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)]
>  
>            Reporter: Aamir
>            Priority: Major
>         Attachments: arabic.pdf, arabic.txt
>
>
> Attached is a PDF with arabic text in it. 
> When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish 
> characters. 
> The generated text doc is also attached which contains the parsed text. 
> Most of the other Arabic PDFs parse fine, but this one is giving this output. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (TIKA-4231) Parsing Arabic PDF is returning bad data

Reply via email to