[GitHub] [tika] THausherr merged pull request #888: Bump aws.version from 1.12.380 to 1.12.381

2023-01-09 Thread GitBox


THausherr merged PR #888:
URL: https://github.com/apache/tika/pull/888


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [tika] dependabot[bot] opened a new pull request, #888: Bump aws.version from 1.12.380 to 1.12.381

2023-01-09 Thread GitBox


dependabot[bot] opened a new pull request, #888:
URL: https://github.com/apache/tika/pull/888

   Bumps `aws.version` from 1.12.380 to 1.12.381.
   Updates `aws-java-sdk-s3` from 1.12.380 to 1.12.381
   
   Changelog
   Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md;>aws-java-sdk-s3's
 changelog.
   
   1.12.381 2023-01-09
   AWS Network Firewall
   
   
   Features
   
   Network Firewall now supports the Suricata rule action reject, in 
addition to the actions pass, drop, and alert.
   
   
   
   AWS Resource Access Manager
   
   
   Features
   
   Enabled FIPS aws-us-gov endpoints in SDK.
   
   
   
   Amazon Elastic Container Registry Public
   
   
   Features
   
   This release for Amazon ECR Public makes several change to bring the SDK 
into sync with the API.
   
   
   
   Amazon Kendra Intelligent Ranking
   
   
   Features
   
   Introducing Amazon Kendra Intelligent Ranking, a new set of Kendra APIs 
that leverages Kendra semantic ranking capabilities to improve the quality of 
search results from other search services (i.e. OpenSearch, ElasticSearch, 
Solr).
   
   
   
   Amazon WorkSpaces Web
   
   
   Features
   
   This release adds support for a new portal authentication type: AWS IAM 
Identity Center (successor to AWS Single Sign-On).
   
   
   
   
   
   
   Commits
   
   https://github.com/aws/aws-sdk-java/commit/facc64566cc3e9c59ad472ccbe03da14cb1f0115;>facc645
 AWS SDK for Java 1.12.381
   https://github.com/aws/aws-sdk-java/commit/8095b21a9a6228a60bf264bf2e2be5a3994319ba;>8095b21
 Update GitHub version number to 1.12.381-SNAPSHOT
   See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.380...1.12.381;>compare 
view
   
   
   
   
   Updates `aws-java-sdk-transcribe` from 1.12.380 to 1.12.381
   
   Changelog
   Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md;>aws-java-sdk-transcribe's
 changelog.
   
   1.12.381 2023-01-09
   AWS Network Firewall
   
   
   Features
   
   Network Firewall now supports the Suricata rule action reject, in 
addition to the actions pass, drop, and alert.
   
   
   
   AWS Resource Access Manager
   
   
   Features
   
   Enabled FIPS aws-us-gov endpoints in SDK.
   
   
   
   Amazon Elastic Container Registry Public
   
   
   Features
   
   This release for Amazon ECR Public makes several change to bring the SDK 
into sync with the API.
   
   
   
   Amazon Kendra Intelligent Ranking
   
   
   Features
   
   Introducing Amazon Kendra Intelligent Ranking, a new set of Kendra APIs 
that leverages Kendra semantic ranking capabilities to improve the quality of 
search results from other search services (i.e. OpenSearch, ElasticSearch, 
Solr).
   
   
   
   Amazon WorkSpaces Web
   
   
   Features
   
   This release adds support for a new portal authentication type: AWS IAM 
Identity Center (successor to AWS Single Sign-On).
   
   
   
   
   
   
   Commits
   
   https://github.com/aws/aws-sdk-java/commit/facc64566cc3e9c59ad472ccbe03da14cb1f0115;>facc645
 AWS SDK for Java 1.12.381
   https://github.com/aws/aws-sdk-java/commit/8095b21a9a6228a60bf264bf2e2be5a3994319ba;>8095b21
 Update GitHub version number to 1.12.381-SNAPSHOT
   See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.380...1.12.381;>compare 
view
   
   
   
   
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the

[jira] [Commented] (TIKA-3952) Content mismatch

2023-01-09 Thread Tika User (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656062#comment-17656062
 ] 

Tika User commented on TIKA-3952:
-

We are not doing any OCR for this. Simple native file and getting all metadata 
related to that document.

> Content mismatch 
> -
>
> Key: TIKA-3952
> URL: https://issues.apache.org/jira/browse/TIKA-3952
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Tika User
>Priority: Major
> Attachments: download.pdf
>
>
> While extracting content of attached file. We are seeing below content 
> mismatch.
> Native file content  : 95 (1972); Erznoznik v. City of Jacksonville
> Content we got from Tika : 95 (1972); Er{*}e{*}noznik v. City of Jacksonville
>  
> Native file content   : 438 U.S.\n726
> Content we got from Tika : 438 {*}U-S{*}.\n726



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3952) Content mismatch

2023-01-09 Thread Tika User (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656063#comment-17656063
 ] 

Tika User commented on TIKA-3952:
-

FYI. I attached PDF file for your reference.

> Content mismatch 
> -
>
> Key: TIKA-3952
> URL: https://issues.apache.org/jira/browse/TIKA-3952
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Tika User
>Priority: Major
> Attachments: download.pdf
>
>
> While extracting content of attached file. We are seeing below content 
> mismatch.
> Native file content  : 95 (1972); Erznoznik v. City of Jacksonville
> Content we got from Tika : 95 (1972); Er{*}e{*}noznik v. City of Jacksonville
>  
> Native file content   : 438 U.S.\n726
> Content we got from Tika : 438 {*}U-S{*}.\n726



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3952) Content mismatch

2023-01-09 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656060#comment-17656060
 ] 

Nick Burch commented on TIKA-3952:
--

Is the PDF a scan? Are you doing OCR?

> Content mismatch 
> -
>
> Key: TIKA-3952
> URL: https://issues.apache.org/jira/browse/TIKA-3952
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Tika User
>Priority: Major
> Attachments: download.pdf
>
>
> While extracting content of attached file. We are seeing below content 
> mismatch.
> Native file content  : 95 (1972); Erznoznik v. City of Jacksonville
> Content we got from Tika : 95 (1972); Er{*}e{*}noznik v. City of Jacksonville
>  
> Native file content   : 438 U.S.\n726
> Content we got from Tika : 438 {*}U-S{*}.\n726



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3952) Content mismatch

2023-01-09 Thread Tika User (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656059#comment-17656059
 ] 

Tika User commented on TIKA-3952:
-

[~nick] I ran this command :



java -jar pdfbox-app.2.0.27.jar ExtractText problematicPDF.pdf

The txt file got created in same location but the file doesn't have any content 
in it.

> Content mismatch 
> -
>
> Key: TIKA-3952
> URL: https://issues.apache.org/jira/browse/TIKA-3952
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Tika User
>Priority: Major
> Attachments: download.pdf
>
>
> While extracting content of attached file. We are seeing below content 
> mismatch.
> Native file content  : 95 (1972); Erznoznik v. City of Jacksonville
> Content we got from Tika : 95 (1972); Er{*}e{*}noznik v. City of Jacksonville
>  
> Native file content   : 438 U.S.\n726
> Content we got from Tika : 438 {*}U-S{*}.\n726



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3952) Content mismatch

2023-01-09 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656049#comment-17656049
 ] 

Nick Burch commented on TIKA-3952:
--

Can you try following the steps in 
[https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-PDFTextProblems]
 ?

> Content mismatch 
> -
>
> Key: TIKA-3952
> URL: https://issues.apache.org/jira/browse/TIKA-3952
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Tika User
>Priority: Major
> Attachments: download.pdf
>
>
> While extracting content of attached file. We are seeing below content 
> mismatch.
> Native file content  : 95 (1972); Erznoznik v. City of Jacksonville
> Content we got from Tika : 95 (1972); Er{*}e{*}noznik v. City of Jacksonville
>  
> Native file content   : 438 U.S.\n726
> Content we got from Tika : 438 {*}U-S{*}.\n726



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-3952) Content mismatch

2023-01-09 Thread Tika User (Jira)
Tika User created TIKA-3952:
---

 Summary: Content mismatch 
 Key: TIKA-3952
 URL: https://issues.apache.org/jira/browse/TIKA-3952
 Project: Tika
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Tika User
 Attachments: download.pdf

While extracting content of attached file. We are seeing below content mismatch.



Native file content  : 95 (1972); Erznoznik v. City of Jacksonville

Content we got from Tika : 95 (1972); Er{*}e{*}noznik v. City of Jacksonville

 

Native file content   : 438 U.S.\n726

Content we got from Tika : 438 {*}U-S{*}.\n726



--
This message was sent by Atlassian Jira
(v8.20.10#820010)