date:20240328

[PR] Bump aws.version from 1.12.689 to 1.12.690 [tika]

2024-03-28 Thread via GitHub



dependabot[bot] opened a new pull request, #1700:
URL: https://github.com/apache/tika/pull/1700

   Bumps `aws.version` from 1.12.689 to 1.12.690.
   Updates `com.amazonaws:aws-java-sdk-s3` from 1.12.689 to 1.12.690
   
   Changelog
   Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md;>com.amazonaws:aws-java-sdk-s3's
 changelog.
   
   1.12.690 2024-03-28
   AWS Compute Optimizer
   
   
   Features
   
   This release enables AWS Compute Optimizer to analyze and generate 
recommendations with a new customization preference, Memory Utilization.
   
   
   
   Amazon CodeCatalyst
   
   
   Features
   
   This release adds support for understanding pending changes to 
subscriptions by including two new response parameters for the GetSubscription 
API for Amazon CodeCatalyst.
   
   
   
   Amazon Elastic Compute Cloud
   
   
   Features
   
   Amazon EC2 C7gd, M7gd and R7gd metal instances with up to 3.8 TB of 
local NVMe-based SSD block-level storage have up to 45% improved real-time NVMe 
storage performance than comparable Graviton2-based instances.
   
   
   
   Amazon Elastic Kubernetes Service
   
   
   Features
   
   Add multiple customer error code to handle customer caused failure when 
managing EKS node groups
   
   
   
   Amazon GuardDuty
   
   
   Features
   
   Add EC2 support for GuardDuty Runtime Monitoring auto management.
   
   
   
   Amazon Neptune Graph
   
   
   Features
   
   Update ImportTaskCancelled waiter to evaluate task state correctly and 
minor documentation changes.
   
   
   
   Amazon QuickSight
   
   
   Features
   
   Amazon QuickSight: Adds support for setting up VPC Endpoint restrictions 
for accessing QuickSight Website.
   
   
   
   CloudWatch Observability Access Manager
   
   
   Features
   
   This release adds support for sharing AWS::InternetMonitor::Monitor 
resources.
   
   
   
   
   
   
   Commits
   
   https://github.com/aws/aws-sdk-java/commit/8158c919c956717dabf2a6ae7cc1d26b592488ac;>8158c91
 AWS SDK for Java 1.12.690
   https://github.com/aws/aws-sdk-java/commit/1b3444aa78f4579c4083bd4b3858322bc343a906;>1b3444a
 Update GitHub version number to 1.12.690-SNAPSHOT
   See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.689...1.12.690;>compare 
view
   
   
   
   
   Updates `com.amazonaws:aws-java-sdk-transcribe` from 1.12.689 to 1.12.690
   
   Changelog
   Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md;>com.amazonaws:aws-java-sdk-transcribe's
 changelog.
   
   1.12.690 2024-03-28
   AWS Compute Optimizer
   
   
   Features
   
   This release enables AWS Compute Optimizer to analyze and generate 
recommendations with a new customization preference, Memory Utilization.
   
   
   
   Amazon CodeCatalyst
   
   
   Features
   
   This release adds support for understanding pending changes to 
subscriptions by including two new response parameters for the GetSubscription 
API for Amazon CodeCatalyst.
   
   
   
   Amazon Elastic Compute Cloud
   
   
   Features
   
   Amazon EC2 C7gd, M7gd and R7gd metal instances with up to 3.8 TB of 
local NVMe-based SSD block-level storage have up to 45% improved real-time NVMe 
storage performance than comparable Graviton2-based instances.
   
   
   
   Amazon Elastic Kubernetes Service
   
   
   Features
   
   Add multiple customer error code to handle customer caused failure when 
managing EKS node groups
   
   
   
   Amazon GuardDuty
   
   
   Features
   
   Add EC2 support for GuardDuty Runtime Monitoring auto management.
   
   
   
   Amazon Neptune Graph
   
   
   Features
   
   Update ImportTaskCancelled waiter to evaluate task state correctly and 
minor documentation changes.
   
   
   
   Amazon QuickSight
   
   
   Features
   
   Amazon QuickSight: Adds support for setting up VPC Endpoint restrictions 
for accessing QuickSight Website.
   
   
   
   CloudWatch Observability Access Manager
   
   
   Features
   
   This release adds support for sharing AWS::InternetMonitor::Monitor 
resources.
   
   
   
   
   
   
   Commits
   
   https://github.com/aws/aws-sdk-java/commit/8158c919c956717dabf2a6ae7cc1d26b592488ac;>8158c91
 AWS SDK for Java 1.12.690
   https://github.com/aws/aws-sdk-java/commit/1b3444aa78f4579c4083bd4b3858322bc343a906;>1b3444a
 Update GitHub version number to 1.12.690-SNAPSHOT
   See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.689...1.12.690;>compare 
view
   
   
   
   
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits

[PR] Bump commons-io:commons-io from 2.15.1 to 2.16.0 [tika]

2024-03-28 Thread via GitHub



dependabot[bot] opened a new pull request, #1701:
URL: https://github.com/apache/tika/pull/1701

   Bumps commons-io:commons-io from 2.15.1 to 2.16.0.
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=commons-io:commons-io=maven=2.15.1=2.16.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Bump aws.version from 1.12.689 to 1.12.690 [tika]

2024-03-28 Thread via GitHub



THausherr merged PR #1700:
URL: https://github.com/apache/tika/pull/1700


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (TIKA-4207) PipesParser should have option to extract raw bytes of embedded files

2024-03-28 Thread Hudson (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831840#comment-17831840
 ] 

Hudson commented on TIKA-4207:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1580 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1580/])
TIKA-4207: Add handling of embedded bytes to tika-pipes (#1699) (github: 
[https://github.com/apache/tika/commit/4fe7312330c430f357012f8d0ff886a0fb344783])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/WMFParser.java
* (add) 
tika-pipes/tika-async-cli/src/test/resources/configs/TIKA-4207-emitter.xml
* (edit) 
tika-core/src/main/java/org/apache/tika/parser/RecursiveParserWrapper.java
* (add) 
tika-core/src/main/java/org/apache/tika/extractor/EmbeddedDocumentByteStoreExtractorFactory.java
* (edit) 
tika-core/src/main/java/org/apache/tika/extractor/ParsingEmbeddedDocumentExtractor.java
* (edit) tika-core/src/test/java/org/apache/tika/pipes/PipesServerTest.java
* (add) tika-app/src/test/java/org/apache/tika/cli/TikaCLIAsyncTest.java
* (edit) tika-pipes/tika-pipes-iterators/pom.xml
* (edit) tika-pipes/tika-async-cli/pom.xml
* (add) tika-pipes/tika-pipes-iterators/tika-pipes-iterator-json/pom.xml
* (edit) 
tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java
* (add) tika-core/src/test/resources/org/apache/tika/pipes/TIKA-4207.xml
* (add) 
tika-core/src/main/java/org/apache/tika/pipes/extractor/EmbeddedDocumentBytesConfig.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/configs/tika-config-no-names.xml
* (delete) 
tika-core/src/test/java/org/apache/tika/pipes/async/AsyncProcessorTest.java
* (add) 
tika-core/src/main/java/org/apache/tika/extractor/EmbeddedDocumentBytesHandler.java
* (add) 
tika-core/src/test/java/org/apache/tika/pipes/async/AsyncChaosMonkeyTest.java
* (add) 
tika-pipes/tika-pipes-iterators/tika-pipes-iterator-json/src/test/resources/test-documents/test.json
* (edit) 
tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/AsyncResource.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/configs/tika-config-with-names.xml
* (add) 
tika-core/src/main/java/org/apache/tika/pipes/extractor/EmittingEmbeddedDocumentBytesHandler.java
* (delete) tika-pipes/tika-async-cli/src/test/resources/tika-config-broken.xml
* (add) 
tika-pipes/tika-async-cli/src/test/resources/configs/tika-config-broken.xml
* (add) 
tika-pipes/tika-pipes-iterators/tika-pipes-iterator-json/src/test/java/org/apache/tika/pipes/pipesiterator/json/TestJsonPipesIterator.java
* (add) 
tika-core/src/main/java/org/apache/tika/extractor/BasicEmbeddedDocumentBytesHandler.java
* (add) 
tika-pipes/tika-pipes-iterators/tika-pipes-iterator-json/src/main/java/org/apache/tika/pipes/pipesiterator/json/JsonPipesIterator.java
* (edit) 
tika-serialization/src/main/java/org/apache/tika/metadata/serialization/JsonFetchEmitTuple.java
* (edit) tika-core/src/main/java/org/apache/tika/pipes/PipesServer.java
* (edit) 
tika-pipes/tika-async-cli/src/test/java/org/apache/tika/async/cli/TikaAsyncCLITest.java
* (edit) 
tika-serialization/src/test/java/org/apache/tika/metadata/serialization/JsonFetchEmitTupleTest.java
* (add) 
tika-pipes/tika-pipes-iterators/tika-pipes-iterator-json/src/test/resources/test-documents/test-with-embedded-bytes.json
* (add) tika-core/src/main/java/org/apache/tika/extractor/RUnpackExtractor.java
* (add) 
tika-core/src/test/java/org/apache/tika/parser/AutoDetectParserConfigTest.java
* (edit) tika-core/src/main/java/org/apache/tika/pipes/FetchEmitTuple.java
* (edit) tika-core/src/main/java/org/apache/tika/parser/AutoDetectParser.java
* (edit) 
tika-server/tika-server-standard/src/test/java/org/apache/tika/server/standard/TikaPipesTest.java
* (add) 
tika-core/src/main/java/org/apache/tika/extractor/RUnpackExtractorFactory.java
* (add) 
tika-core/src/test/resources/org/apache/tika/pipes/TIKA-4207-limit-bytes.xml
* (add) 
tika-core/src/main/java/org/apache/tika/extractor/BasicEmbeddedBytesSelector.java
* (add) 
tika-core/src/main/java/org/apache/tika/extractor/AbstractEmbeddedDocumentBytesHandler.java
* (add) 
tika-core/src/main/java/org/apache/tika/extractor/EmbeddedBytesSelector.java
* (edit) tika-core/src/main/java/org/apache/tika/io/BoundedInputStream.java
* (edit) 
tika-core/src/main/java/org/apache/tika/parser/AutoDetectParserConfig.java
* (edit) tika-core/src/test/java/org/apache/tika/parser/mock/MockParser.java
* (edit) tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java
* (add) 
tika-pipes/tika-async-cli/src/test/java/org/apache/tika/async/cli/AsyncProcessorTest.java
* (add) 
tika-pipes/tika-async-cli/src/test/resources/test-documents/basic_embedded.xml
* (add)

[jira] [Commented] (TIKA-4228) Tika parser crashes JVM when it gets metadata and embedded objects from pdf

2024-03-28 Thread Xiaohong Yang (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831835#comment-17831835
 ] 

Xiaohong Yang commented on TIKA-4228:
-

It is not multithreaded. I will try to get the exit value of the process (if 
possible).  I will also check if there is a core dump on the machine.

> Tika parser crashes JVM when it gets metadata and embedded objects from pdf
> ---
>
> Key: TIKA-4228
> URL: https://issues.apache.org/jira/browse/TIKA-4228
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: tika-config-and-sample-file.zip
>
>
> [^tika-config-and-sample-file.zip]
>  
> We use org.apache.tika.parser.AutoDetectParser to get metadata and embedded 
> objects from pdf documents.  And we found out that it crashes the program (or 
> the JVM) when it gets metadata and embedded files from the sample pdf file.
>  
> Following is the sample code and attached is the tika-config.xml and the 
> sample pdf file. Note that the sample file crashes the JVM in 1 out of 10 
> runs in our production environment.  Sometimes it happens when it gets 
> metadata and sometimes it happens when it extracts embedded files (the 
> chances are about 50/50).
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.0 and POI version is 5.2.3.   
>  
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ProcessPdf {
>     private final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_pdf/sample.pdf").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pdf/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try
> {     System.out.println("Start");     ProcessPdf processPdf 
> = new ProcessPdf();     System.out.println("Get metadata");   
>   processPdf.getMataData();     System.out.println("Extract embedded 
> files");     processPdf.extract();     
> System.out.println("End");     }
>     catch(Exception ex)
> {     ex.printStackTrace();     }
>     }
>  
>     public ProcessPdf()
> {     }
>  
>     public void getMataData() throws Exception {
>     BodyContentHandler handler = new BodyContentHandler(-1);
>  
>     Metadata metadata = new Metadata();
>     try (FileInputStream inputData = new 
> FileInputStream(inputFile.toString()))
> {     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");     
> Parser autoDetectParser = new AutoDetectParser(config);     
> ParseContext context = new ParseContext();     
> context.set(TikaConfig.class, config);     
> autoDetectParser.parse(inputData, handler, metadata, context);     }
>  
>     String content = handler.toString();
>     }
>  
>     public void extract() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");
>     ProcessPdf.FileEmbeddedDocumentExtractor 
> fileEmbeddedDocumentExtractor = new 
> ProcessPdf.FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>     URL url = inputFile.toUri().toURL();
>     Metadata metadata = new Metadata();
>     try (InputStream input = TikaInputStream.get(url, metadata))
> {     ContentHandler

[jira] [Resolved] (TIKA-4207) PipesParser should have option to extract raw bytes of embedded files

2024-03-28 Thread Tim Allison (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-4207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4207.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> PipesParser should have option to extract raw bytes of embedded files
> -
>
> Key: TIKA-4207
> URL: https://issues.apache.org/jira/browse/TIKA-4207
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Major
> Fix For: 3.0.0
>
>
> There are many use cases, where text+metadata are important, but users also 
> need the raw bytes from embedded files.
> Let's make it possible to extract the usual rmeta content in _and_ the raw 
> bytes. This is a preliminary step that will offer more customization options 
> than the proposal in TIKA-3703.
> This is targeted to 3.x.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4207) PipesParser should have option to extract raw bytes of embedded files

2024-03-28 Thread Tim Allison (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831815#comment-17831815
 ] 

Tim Allison commented on TIKA-4207:
---

There are some areas for simplification, but I think this is good enough to go 
for now.

> PipesParser should have option to extract raw bytes of embedded files
> -
>
> Key: TIKA-4207
> URL: https://issues.apache.org/jira/browse/TIKA-4207
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Major
>
> There are many use cases, where text+metadata are important, but users also 
> need the raw bytes from embedded files.
> Let's make it possible to extract the usual rmeta content in _and_ the raw 
> bytes. This is a preliminary step that will offer more customization options 
> than the proposal in TIKA-3703.
> This is targeted to 3.x.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4207) PipesParser should have option to extract raw bytes of embedded files

2024-03-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831785#comment-17831785
 ] 

ASF GitHub Bot commented on TIKA-4207:
--

tballison merged PR #1699:
URL: https://github.com/apache/tika/pull/1699




> PipesParser should have option to extract raw bytes of embedded files
> -
>
> Key: TIKA-4207
> URL: https://issues.apache.org/jira/browse/TIKA-4207
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Major
>
> There are many use cases, where text+metadata are important, but users also 
> need the raw bytes from embedded files.
> Let's make it possible to extract the usual rmeta content in _and_ the raw 
> bytes. This is a preliminary step that will offer more customization options 
> than the proposal in TIKA-3703.
> This is targeted to 3.x.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] TIKA-4207: Add handling of embedded bytes to tika-pipes [tika]

2024-03-28 Thread via GitHub



tballison merged PR #1699:
URL: https://github.com/apache/tika/pull/1699


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (TIKA-4230) Optimized code ComparableVersion

2024-03-28 Thread zhao tao (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-4230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhao tao updated TIKA-4230:
---
Issue Type: Improvement  (was: Bug)

> Optimized code ComparableVersion
> 
>
> Key: TIKA-4230
> URL: https://issues.apache.org/jira/browse/TIKA-4230
> Project: Tika
>  Issue Type: Improvement
>Reporter: zhao tao
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (TIKA-4230) Optimized code ComparableVersion

2024-03-28 Thread zhao tao (Jira)

zhao tao created TIKA-4230:
--

 Summary: Optimized code ComparableVersion
 Key: TIKA-4230
 URL: https://issues.apache.org/jira/browse/TIKA-4230
 Project: Tika
  Issue Type: Bug
Reporter: zhao tao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4207) PipesParser should have option to extract raw bytes of embedded files

2024-03-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831732#comment-17831732
 ] 

ASF GitHub Bot commented on TIKA-4207:
--

tballison opened a new pull request, #1699:
URL: https://github.com/apache/tika/pull/1699

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> PipesParser should have option to extract raw bytes of embedded files
> -
>
> Key: TIKA-4207
> URL: https://issues.apache.org/jira/browse/TIKA-4207
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Major
>
> There are many use cases, where text+metadata are important, but users also 
> need the raw bytes from embedded files.
> Let's make it possible to extract the usual rmeta content in _and_ the raw 
> bytes. This is a preliminary step that will offer more customization options 
> than the proposal in TIKA-3703.
> This is targeted to 3.x.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[PR] TIKA-4207: Add handling of embedded bytes to tika-pipes [tika]

2024-03-28 Thread via GitHub



tballison opened a new pull request, #1699:
URL: https://github.com/apache/tika/pull/1699

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (TIKA-4229) add microsoft graph fetcher

2024-03-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831680#comment-17831680
 ] 

ASF GitHub Bot commented on TIKA-4229:
--

nddipiazza opened a new pull request, #1698:
URL: https://github.com/apache/tika/pull/1698

   initial attempt to add microsoft graph fetcher
   
   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> add microsoft graph fetcher
> ---
>
> Key: TIKA-4229
> URL: https://issues.apache.org/jira/browse/TIKA-4229
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-pipes
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> add a tika pipes fetcher capable of fetching files from MS graph api



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[PR] TIKA-4229 [tika]

2024-03-28 Thread via GitHub



nddipiazza opened a new pull request, #1698:
URL: https://github.com/apache/tika/pull/1698

   initial attempt to add microsoft graph fetcher
   
   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (TIKA-4229) add microsoft graph fetcher

2024-03-28 Thread Nicholas DiPiazza (Jira)

Nicholas DiPiazza created TIKA-4229:
---

 Summary: add microsoft graph fetcher
 Key: TIKA-4229
 URL: https://issues.apache.org/jira/browse/TIKA-4229
 Project: Tika
  Issue Type: New Feature
  Components: tika-pipes
Reporter: Nicholas DiPiazza


add a tika pipes fetcher capable of fetching files from MS graph api



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] Bump com.github.luben:zstd-jni from 1.5.5-11 to 1.5.6-1 [tika]

2024-03-28 Thread via GitHub



THausherr merged PR #1697:
URL: https://github.com/apache/tika/pull/1697


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] Bump aws.version from 1.12.689 to 1.12.690 [tika]

[PR] Bump commons-io:commons-io from 2.15.1 to 2.16.0 [tika]

Re: [PR] Bump aws.version from 1.12.689 to 1.12.690 [tika]

[jira] [Commented] (TIKA-4207) PipesParser should have option to extract raw bytes of embedded files

[jira] [Commented] (TIKA-4228) Tika parser crashes JVM when it gets metadata and embedded objects from pdf

[jira] [Resolved] (TIKA-4207) PipesParser should have option to extract raw bytes of embedded files

[jira] [Commented] (TIKA-4207) PipesParser should have option to extract raw bytes of embedded files

[jira] [Commented] (TIKA-4207) PipesParser should have option to extract raw bytes of embedded files

Re: [PR] TIKA-4207: Add handling of embedded bytes to tika-pipes [tika]

[jira] [Updated] (TIKA-4230) Optimized code ComparableVersion

[jira] [Created] (TIKA-4230) Optimized code ComparableVersion

[jira] [Commented] (TIKA-4207) PipesParser should have option to extract raw bytes of embedded files

[PR] TIKA-4207: Add handling of embedded bytes to tika-pipes [tika]

[jira] [Commented] (TIKA-4229) add microsoft graph fetcher

[PR] TIKA-4229 [tika]

[jira] [Created] (TIKA-4229) add microsoft graph fetcher

Re: [PR] Bump com.github.luben:zstd-jni from 1.5.5-11 to 1.5.6-1 [tika]

17 matches

Site Navigation

Mail list logo

Footer information