[jira] [Commented] (TIKA-3952) Content mismatch

2023-01-11 Thread Tika User (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17665545#comment-17665545
 ] 

Tika User commented on TIKA-3952:
-

Got it. Thanks

> Content mismatch 
> -
>
> Key: TIKA-3952
> URL: https://issues.apache.org/jira/browse/TIKA-3952
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Tika User
>Priority: Major
> Attachments: download.pdf
>
>
> While extracting content of attached file. We are seeing below content 
> mismatch.
> Native file content  : 95 (1972); Erznoznik v. City of Jacksonville
> Content we got from Tika : 95 (1972); Er{*}e{*}noznik v. City of Jacksonville
>  
> Native file content   : 438 U.S.\n726
> Content we got from Tika : 438 {*}U-S{*}.\n726



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3952) Content mismatch

2023-01-10 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656740#comment-17656740
 ] 

Tilman Hausherr commented on TIKA-3952:
---

This online OCR page has the same error: https://ocr.space/ (use engine3)

> Content mismatch 
> -
>
> Key: TIKA-3952
> URL: https://issues.apache.org/jira/browse/TIKA-3952
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Tika User
>Priority: Major
> Attachments: download.pdf
>
>
> While extracting content of attached file. We are seeing below content 
> mismatch.
> Native file content  : 95 (1972); Erznoznik v. City of Jacksonville
> Content we got from Tika : 95 (1972); Er{*}e{*}noznik v. City of Jacksonville
>  
> Native file content   : 438 U.S.\n726
> Content we got from Tika : 438 {*}U-S{*}.\n726



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3952) Content mismatch

2023-01-10 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656736#comment-17656736
 ] 

Tilman Hausherr commented on TIKA-3952:
---

You are doing OCR or it's the wrong file. The attached file does not have any 
text, only a bitmap.

> Content mismatch 
> -
>
> Key: TIKA-3952
> URL: https://issues.apache.org/jira/browse/TIKA-3952
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Tika User
>Priority: Major
> Attachments: download.pdf
>
>
> While extracting content of attached file. We are seeing below content 
> mismatch.
> Native file content  : 95 (1972); Erznoznik v. City of Jacksonville
> Content we got from Tika : 95 (1972); Er{*}e{*}noznik v. City of Jacksonville
>  
> Native file content   : 438 U.S.\n726
> Content we got from Tika : 438 {*}U-S{*}.\n726



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3952) Content mismatch

2023-01-09 Thread Tika User (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656062#comment-17656062
 ] 

Tika User commented on TIKA-3952:
-

We are not doing any OCR for this. Simple native file and getting all metadata 
related to that document.

> Content mismatch 
> -
>
> Key: TIKA-3952
> URL: https://issues.apache.org/jira/browse/TIKA-3952
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Tika User
>Priority: Major
> Attachments: download.pdf
>
>
> While extracting content of attached file. We are seeing below content 
> mismatch.
> Native file content  : 95 (1972); Erznoznik v. City of Jacksonville
> Content we got from Tika : 95 (1972); Er{*}e{*}noznik v. City of Jacksonville
>  
> Native file content   : 438 U.S.\n726
> Content we got from Tika : 438 {*}U-S{*}.\n726



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3952) Content mismatch

2023-01-09 Thread Tika User (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656063#comment-17656063
 ] 

Tika User commented on TIKA-3952:
-

FYI. I attached PDF file for your reference.

> Content mismatch 
> -
>
> Key: TIKA-3952
> URL: https://issues.apache.org/jira/browse/TIKA-3952
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Tika User
>Priority: Major
> Attachments: download.pdf
>
>
> While extracting content of attached file. We are seeing below content 
> mismatch.
> Native file content  : 95 (1972); Erznoznik v. City of Jacksonville
> Content we got from Tika : 95 (1972); Er{*}e{*}noznik v. City of Jacksonville
>  
> Native file content   : 438 U.S.\n726
> Content we got from Tika : 438 {*}U-S{*}.\n726



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3952) Content mismatch

2023-01-09 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656060#comment-17656060
 ] 

Nick Burch commented on TIKA-3952:
--

Is the PDF a scan? Are you doing OCR?

> Content mismatch 
> -
>
> Key: TIKA-3952
> URL: https://issues.apache.org/jira/browse/TIKA-3952
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Tika User
>Priority: Major
> Attachments: download.pdf
>
>
> While extracting content of attached file. We are seeing below content 
> mismatch.
> Native file content  : 95 (1972); Erznoznik v. City of Jacksonville
> Content we got from Tika : 95 (1972); Er{*}e{*}noznik v. City of Jacksonville
>  
> Native file content   : 438 U.S.\n726
> Content we got from Tika : 438 {*}U-S{*}.\n726



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3952) Content mismatch

2023-01-09 Thread Tika User (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656059#comment-17656059
 ] 

Tika User commented on TIKA-3952:
-

[~nick] I ran this command :



java -jar pdfbox-app.2.0.27.jar ExtractText problematicPDF.pdf

The txt file got created in same location but the file doesn't have any content 
in it.

> Content mismatch 
> -
>
> Key: TIKA-3952
> URL: https://issues.apache.org/jira/browse/TIKA-3952
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Tika User
>Priority: Major
> Attachments: download.pdf
>
>
> While extracting content of attached file. We are seeing below content 
> mismatch.
> Native file content  : 95 (1972); Erznoznik v. City of Jacksonville
> Content we got from Tika : 95 (1972); Er{*}e{*}noznik v. City of Jacksonville
>  
> Native file content   : 438 U.S.\n726
> Content we got from Tika : 438 {*}U-S{*}.\n726



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3952) Content mismatch

2023-01-09 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656049#comment-17656049
 ] 

Nick Burch commented on TIKA-3952:
--

Can you try following the steps in 
[https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-PDFTextProblems]
 ?

> Content mismatch 
> -
>
> Key: TIKA-3952
> URL: https://issues.apache.org/jira/browse/TIKA-3952
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Tika User
>Priority: Major
> Attachments: download.pdf
>
>
> While extracting content of attached file. We are seeing below content 
> mismatch.
> Native file content  : 95 (1972); Erznoznik v. City of Jacksonville
> Content we got from Tika : 95 (1972); Er{*}e{*}noznik v. City of Jacksonville
>  
> Native file content   : 438 U.S.\n726
> Content we got from Tika : 438 {*}U-S{*}.\n726



--
This message was sent by Atlassian Jira
(v8.20.10#820010)