(tika) branch dependabot/maven/io.netty-netty-bom-4.1.108.Final created (now fd23e6c27)
This is an automated email from the ASF dual-hosted git repository. github-bot pushed a change to branch dependabot/maven/io.netty-netty-bom-4.1.108.Final in repository https://gitbox.apache.org/repos/asf/tika.git at fd23e6c27 Bump io.netty:netty-bom from 4.1.107.Final to 4.1.108.Final No new revisions were added by this update.
(tika) branch dependabot/maven/com.google.cloud-google-cloud-storage-2.36.1 created (now 8e27e31a6)
This is an automated email from the ASF dual-hosted git repository. github-bot pushed a change to branch dependabot/maven/com.google.cloud-google-cloud-storage-2.36.1 in repository https://gitbox.apache.org/repos/asf/tika.git at 8e27e31a6 Bump com.google.cloud:google-cloud-storage from 2.36.0 to 2.36.1 No new revisions were added by this update.
(tika) branch dependabot/maven/aws.version-1.12.685 created (now a01e3edb4)
This is an automated email from the ASF dual-hosted git repository. github-bot pushed a change to branch dependabot/maven/aws.version-1.12.685 in repository https://gitbox.apache.org/repos/asf/tika.git at a01e3edb4 Bump aws.version from 1.12.684 to 1.12.685 No new revisions were added by this update.
(tika) branch TIKA-4207 updated: TIKA-4207 -- refactor to use inputstreams instead of byte arrays. add max bytes extracted
This is an automated email from the ASF dual-hosted git repository. tallison pushed a commit to branch TIKA-4207 in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/TIKA-4207 by this push: new 59608e69b TIKA-4207 -- refactor to use inputstreams instead of byte arrays. add max bytes extracted 59608e69b is described below commit 59608e69bdaeb8a8151e1e9f27b1ef7c3030288b Author: tallison AuthorDate: Thu Mar 21 17:19:37 2024 -0400 TIKA-4207 -- refactor to use inputstreams instead of byte arrays. add max bytes extracted --- .../AbstractEmbeddedDocumentByteStore.java | 3 +- .../extractor/BasicEmbeddedDocumentByteStore.java | 16 ++-- .../tika/extractor/EmbeddedDocumentByteStore.java | 5 +- .../tika/extractor/EmbeddedDocumentUtil.java | 2 +- .../ParsingEmbeddedDocumentExtractor.java | 40 +++-- .../ParsingEmbeddedDocumentExtractorFactory.java | 22 - .../org/apache/tika/io/BoundedInputStream.java | 4 + .../java/org/apache/tika/pipes/PipesServer.java| 5 +- .../extractor/EmbeddedDocumentBytesConfig.java | 6 +- .../extractor/EmbeddedDocumentEmitterStore.java| 9 +- .../org/apache/tika/pipes/PipesServerTest.java | 58 - .../apache/tika/pipes/TIKA-4207-limit-bytes.xml| 34 .../parser/microsoft/pst/OutlookPSTParserTest.java | 2 +- .../apache/tika/parser/pdf/PDFRenderingTest.java | 2 +- .../apache/tika/server/standard/TikaPipesTest.java | 97 +- 15 files changed, 270 insertions(+), 35 deletions(-) diff --git a/tika-core/src/main/java/org/apache/tika/extractor/AbstractEmbeddedDocumentByteStore.java b/tika-core/src/main/java/org/apache/tika/extractor/AbstractEmbeddedDocumentByteStore.java index 214c2ab4e..15b26451a 100644 --- a/tika-core/src/main/java/org/apache/tika/extractor/AbstractEmbeddedDocumentByteStore.java +++ b/tika-core/src/main/java/org/apache/tika/extractor/AbstractEmbeddedDocumentByteStore.java @@ -17,6 +17,7 @@ package org.apache.tika.extractor; import java.io.IOException; +import java.io.InputStream; import java.util.ArrayList; import java.util.List; import java.util.Locale; @@ -57,7 +58,7 @@ public abstract class AbstractEmbeddedDocumentByteStore implements EmbeddedDocum } @Override -public void add(int id, Metadata metadata, byte[] bytes) throws IOException { +public void add(int id, Metadata metadata, InputStream bytes) throws IOException { ids.add(id); } diff --git a/tika-core/src/main/java/org/apache/tika/extractor/BasicEmbeddedDocumentByteStore.java b/tika-core/src/main/java/org/apache/tika/extractor/BasicEmbeddedDocumentByteStore.java index b41285eb0..d3aeb4507 100644 --- a/tika-core/src/main/java/org/apache/tika/extractor/BasicEmbeddedDocumentByteStore.java +++ b/tika-core/src/main/java/org/apache/tika/extractor/BasicEmbeddedDocumentByteStore.java @@ -17,9 +17,13 @@ package org.apache.tika.extractor; import java.io.IOException; +import java.io.InputStream; import java.util.HashMap; import java.util.Map; +import org.apache.commons.io.IOUtils; +import org.apache.commons.io.input.UnsynchronizedBufferedInputStream; + import org.apache.tika.metadata.Metadata; import org.apache.tika.pipes.extractor.EmbeddedDocumentBytesConfig; @@ -30,13 +34,15 @@ public class BasicEmbeddedDocumentByteStore extends AbstractEmbeddedDocumentByte } //this won't scale, but let's start fully in memory for now; Map docBytes = new HashMap<>(); -public void add(int id, Metadata metadata, byte[] bytes) throws IOException { -super.add(id, metadata, bytes); -docBytes.put(id, bytes); +@Override +public void add(int id, Metadata metadata, InputStream is) throws IOException { +super.add(id, metadata, is); +docBytes.put(id, IOUtils.toByteArray(is)); } -public byte[] getDocument(int id) { -return docBytes.get(id); +@Override +public InputStream getDocument(int id) throws IOException { +return new UnsynchronizedBufferedInputStream.Builder().setByteArray(docBytes.get(id)).get(); } @Override diff --git a/tika-core/src/main/java/org/apache/tika/extractor/EmbeddedDocumentByteStore.java b/tika-core/src/main/java/org/apache/tika/extractor/EmbeddedDocumentByteStore.java index ad1bb81f3..8e1e8e325 100644 --- a/tika-core/src/main/java/org/apache/tika/extractor/EmbeddedDocumentByteStore.java +++ b/tika-core/src/main/java/org/apache/tika/extractor/EmbeddedDocumentByteStore.java @@ -18,15 +18,16 @@ package org.apache.tika.extractor; import java.io.Closeable; import java.io.IOException; +import java.io.InputStream; import java.util.List; import org.apache.tika.metadata.Metadata; public interface EmbeddedDocumentByteStore extends Closeable { //we need metadata for the emitter store...can we get away without it? -void add(int id, Metadata metadata, byte[] bytes)
(tika) branch TIKA-4207 updated (7ca6d1759 -> 9ffc4df4a)
This is an automated email from the ASF dual-hosted git repository. tallison pushed a change to branch TIKA-4207 in repository https://gitbox.apache.org/repos/asf/tika.git from 7ca6d1759 TIKA-4207 -- small improvements to AsyncResource and WMFParser add 36a0dca43 TIKA-4205 -- fix dependencies in tika-eval-app and add a few more columns to the ExtractProfiler (#1629) add 2bc0f9bdc TIKA-4202 -- add ocr page count to PDFs -- actually increment counter and move the location of the counter to before OCR is invoked (#1630) add 0be76cf28 Bump logback.version from 1.5.0 to 1.5.1 add 4a5a21ea1 Merge pull request #1632 from apache/dependabot/maven/logback.version-1.5.1 add 8ab8673ce Bump aws.version from 1.12.668 to 1.12.669 add 386a5934a Merge pull request #1631 from apache/dependabot/maven/aws.version-1.12.669 add 215b75b67 TIKA-4166: update puppycrawl add b3e4252b2 Bump aws.version from 1.12.669 to 1.12.670 add 1f9e773e8 Merge pull request #1634 from apache/dependabot/maven/aws.version-1.12.670 add 6b726fbe5 Bump jakarta.activation:jakarta.activation-api from 2.1.2 to 2.1.3 add 6a0a59d42 Merge pull request #1635 from apache/dependabot/maven/jakarta.activation-jakarta.activation-api-2.1.3 add ffc7df20f TIKA-4166: update aws, azure, mockito add b5023198b Bump logback.version from 1.5.1 to 1.5.2 add 86d1e897e Merge pull request #1637 from apache/dependabot/maven/logback.version-1.5.2 add 1a5f23ff4 Bump aws.version from 1.12.671 to 1.12.672 add e3bb8cfea Merge pull request #1638 from apache/dependabot/maven/aws.version-1.12.672 add c8097b6ad Bump logback.version from 1.5.2 to 1.5.3 add dc612a7b5 Merge pull request #1639 from apache/dependabot/maven/logback.version-1.5.3 add 32ef34ff4 TIKA-4199: add comment, print to stderr add 64c083d12 Bump aws.version from 1.12.672 to 1.12.673 add 2f6e4cd30 Merge pull request #1640 from apache/dependabot/maven/aws.version-1.12.673 add 36664ef41 Bump com.google.cloud:google-cloud-storage from 2.34.0 to 2.35.0 add 26c33d46c Merge pull request #1641 from apache/dependabot/maven/com.google.cloud-google-cloud-storage-2.35.0 add 6cf215017 Bump org.testcontainers:testcontainers-bom from 1.19.6 to 1.19.7 add 8b3230dff Merge pull request #1642 from apache/dependabot/maven/org.testcontainers-testcontainers-bom-1.19.7 add 5221d8874 Bump aws.version from 1.12.673 to 1.12.674 add 43a4e58cc Merge pull request #1643 from apache/dependabot/maven/aws.version-1.12.674 add b7c5d48ce Bump aws.version from 1.12.674 to 1.12.675 add 79b194a69 Merge pull request #1644 from apache/dependabot/maven/aws.version-1.12.675 add a89e9779f Bump jakarta.xml.bind:jakarta.xml.bind-api from 4.0.1 to 4.0.2 add 4af4be5be Merge pull request #1645 from apache/dependabot/maven/jakarta.xml.bind-jakarta.xml.bind-api-4.0.2 add 8b398201a TIKA-4199: revert "complete delegate class", field "in" is a dummy; remove workaround for commons-compress 1.26 add 5b259d60a TIKA-4199: adjust test results now that commons compress bug has been fixed add 4d6acfc10 TIKA-4199: update commons-compress add 1dd99bf45 TIKA-4166: update aws add 5f4e380ff TIKA-4166: update jaxb add d477bfd3b TIKA-4166: revert jaxb update add 0f077da2a TIKA-4166: update jaxb and prevent convergence problem add f0b76e503 Bump com.googlecode.plist:dd-plist from 1.27 to 1.28 add da3f8c970 Merge pull request #1649 from apache/dependabot/maven/com.googlecode.plist-dd-plist-1.28 add 67790a364 Bump org.apache.maven.plugins:maven-assembly-plugin from 3.6.0 to 3.7.0 add 418258161 Merge pull request #1646 from apache/dependabot/maven/org.apache.maven.plugins-maven-assembly-plugin-3.7.0 add bc2167a30 Bump log4j2.version from 2.23.0 to 2.23.1 add 17caf585d Merge pull request #1648 from apache/dependabot/maven/log4j2.version-2.23.1 add b980d9d86 Bump com.fasterxml.jackson:jackson-bom from 2.16.1 to 2.16.2 add bdb6a4656 Merge pull request #1647 from apache/dependabot/maven/com.fasterxml.jackson-jackson-bom-2.16.2 add 84f0a5b7f Bump aws.version from 1.12.676 to 1.12.677 add 3a7bbc50d Merge pull request #1651 from apache/dependabot/maven/aws.version-1.12.677 add 3ffadd5a3 Bump aws.version from 1.12.677 to 1.12.678 add 49064dbe2 Merge pull request #1652 from apache/dependabot/maven/aws.version-1.12.678 add e65d52cb5 Bump org.xerial:sqlite-jdbc from 3.45.1.0 to 3.45.2.0 add 846f3a080 Merge pull request #1655 from apache/dependabot/maven/org.xerial-sqlite-jdbc-3.45.2.0 add be7640d53 Bump com.fasterxml.jackson:jackson-bom from 2.16.2 to 2.17.0 add 7cd6ee86b Merge pull request #1653 from apache/dependabot/maven/com.fasterxml.jackson-jackson-bom-2.17.0 add 23d26d770 Bump reactor.netty.version from 1.1.15 to 1.1.17 add 18d9fd769 Merge pull request #1654 from
(tika) 02/02: TIKA-4207 -- allow users to configure include/exclude for attachment types and/or mime types
This is an automated email from the ASF dual-hosted git repository. tallison pushed a commit to branch TIKA-4207 in repository https://gitbox.apache.org/repos/asf/tika.git commit 9ffc4df4a3d059d54e1e1851b8d024b24d2043f9 Author: tallison AuthorDate: Thu Mar 21 13:48:16 2024 -0400 TIKA-4207 -- allow users to configure include/exclude for attachment types and/or mime types --- .../tika/extractor/BasicEmbeddedBytesSelector.java | 77 ++ ...ctorFactory.java => EmbeddedBytesSelector.java} | 24 +++ .../ParsingEmbeddedDocumentExtractor.java | 28 +++- .../ParsingEmbeddedDocumentExtractorFactory.java | 56 ++-- .../apache/tika/metadata/TikaCoreProperties.java | 4 ++ .../tika/parser/AutoDetectParserConfigTest.java| 72 .../config/TIKA-4207-embedded-bytes-config.xml | 38 +++ 7 files changed, 277 insertions(+), 22 deletions(-) diff --git a/tika-core/src/main/java/org/apache/tika/extractor/BasicEmbeddedBytesSelector.java b/tika-core/src/main/java/org/apache/tika/extractor/BasicEmbeddedBytesSelector.java new file mode 100644 index 0..1d5a239db --- /dev/null +++ b/tika-core/src/main/java/org/apache/tika/extractor/BasicEmbeddedBytesSelector.java @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.tika.extractor; + +import java.util.Set; + +import org.apache.tika.metadata.Metadata; +import org.apache.tika.metadata.TikaCoreProperties; +import org.apache.tika.mime.MediaType; +import org.apache.tika.utils.StringUtils; + +public class BasicEmbeddedBytesSelector implements EmbeddedBytesSelector { + + + +private final Set includeMimes; +private final Set excludeMimes; +private final Set includeEmbeddedResourceTypes; + +private final Set excludeEmbeddedResourceTypes; + +public BasicEmbeddedBytesSelector(Set includeMimes, Set excludeMimes, + Set includeEmbeddedResourceTypes, + Set excludeEmbeddedResourceTypes) { +this.includeMimes = includeMimes; +this.excludeMimes = excludeMimes; +this.includeEmbeddedResourceTypes = includeEmbeddedResourceTypes; +this.excludeEmbeddedResourceTypes = excludeEmbeddedResourceTypes; +} + +public boolean select(Metadata metadata) { +String mime = metadata.get(Metadata.CONTENT_TYPE); +if (mime == null) { +mime = ""; +} else { +//if mime matters at all, make sure to get the mime without parameters +if (includeMimes.size() > 0 || excludeMimes.size() > 0) { +MediaType mt = MediaType.parse(mime); +if (mt != null) { +mime = mt.getType() + "/" + mt.getSubtype(); +} +} +} +if (excludeMimes.contains(mime)) { +return false; +} +if (includeMimes.size() > 0 && ! includeMimes.contains(mime)) { +return false; +} +String embeddedResourceType = metadata.get(TikaCoreProperties.EMBEDDED_RESOURCE_TYPE); +//if a parser doesn't specify the type, treat it as ATTACHMENT +embeddedResourceType = StringUtils.isBlank(embeddedResourceType) ? "ATTACHMENT" : +embeddedResourceType; + +if (excludeEmbeddedResourceTypes.contains(embeddedResourceType)) { +return false; +} +if (includeEmbeddedResourceTypes.size() > 0 && includeEmbeddedResourceTypes.contains(embeddedResourceType)) { +return true; +} +return false; +} +} diff --git a/tika-core/src/main/java/org/apache/tika/extractor/ParsingEmbeddedDocumentExtractorFactory.java b/tika-core/src/main/java/org/apache/tika/extractor/EmbeddedBytesSelector.java similarity index 55% copy from tika-core/src/main/java/org/apache/tika/extractor/ParsingEmbeddedDocumentExtractorFactory.java copy to tika-core/src/main/java/org/apache/tika/extractor/EmbeddedBytesSelector.java index 9136228c4..2ec7df667 100644 --- a/tika-core/src/main/java/org/apache/tika/extractor/ParsingEmbeddedDocumentExtractorFactory.java +++
(tika) 01/02: Merge remote-tracking branch 'origin/main' into TIKA-4207
This is an automated email from the ASF dual-hosted git repository. tallison pushed a commit to branch TIKA-4207 in repository https://gitbox.apache.org/repos/asf/tika.git commit dae75c632055d980fdad047fe07dd745359fca3f Merge: 7ca6d1759 08727d522 Author: tallison AuthorDate: Thu Mar 21 12:21:52 2024 -0400 Merge remote-tracking branch 'origin/main' into TIKA-4207 .../src/main/java/org/apache/tika/cli/TikaCLI.java | 2 +- tika-core/src/main/java/org/apache/tika/Tika.java | 4 ++ .../main/java/org/apache/tika/metadata/PDF.java| 4 ++ .../org/apache/tika/mime/tika-mimetypes.xml| 53 +++ tika-eval/tika-eval-app/pom.xml| 2 - .../org/apache/tika/eval/app/AbstractProfiler.java | 17 +- .../org/apache/tika/eval/app/ExtractProfiler.java | 4 ++ .../java/org/apache/tika/eval/app/db/Cols.java | 3 ++ tika-parent/pom.xml| 60 -- .../ooxml/XSLFPowerPointExtractorDecorator.java| 3 +- .../apache/tika/parser/ocr/TesseractOCRParser.java | 20 ++-- .../apache/tika/parser/pdf/AbstractPDF2XHTML.java | 6 +++ .../org/apache/tika/parser/pdf/OCRPageCounter.java | 4 ++ .../org/apache/tika/parser/pdf/PDFParserTest.java | 8 +++ .../org/apache/tika/parser/pkg/PackageParser.java | 50 +- .../parser/microsoft/ooxml/TruncatedOOXMLTest.java | 4 +- .../tika/parser/ocr/TesseractOCRParserTest.java| 9 .../apache/tika/parser/pkg/Seven7ParserTest.java | 3 +- .../pipes/reporters/jdbc/JDBCPipesReporter.java| 52 ++- .../apache/tika/server/core/TikaServerProcess.java | 2 +- .../tika/server/core/resource/TikaResource.java| 2 +- .../apache/tika/server/core/TikaVersionTest.java | 2 +- .../apache/tika/server/core/TikaWelcomeTest.java | 4 +- 23 files changed, 193 insertions(+), 125 deletions(-)
(tika) branch branch_2x updated: TIKA-4217 -- require new line or white space as part of bitmap magic (#1674)
This is an automated email from the ASF dual-hosted git repository. tallison pushed a commit to branch branch_2x in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/branch_2x by this push: new 61e55b7f1 TIKA-4217 -- require new line or white space as part of bitmap magic (#1674) 61e55b7f1 is described below commit 61e55b7f1d48687a5d4329cc5c9cf360b19bfb58 Author: Tim Allison AuthorDate: Thu Mar 21 11:33:31 2024 -0400 TIKA-4217 -- require new line or white space as part of bitmap magic (#1674) (cherry picked from commit 08727d5224f7c663a19c572154939a2c140a6773) --- .../org/apache/tika/mime/tika-mimetypes.xml| 53 ++ 1 file changed, 44 insertions(+), 9 deletions(-) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index 7a6d660e5..ca2dcaa6f 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -6596,8 +6596,16 @@ PBM <_comment>Portable Bit Map - - + + + + + + + + + + @@ -6607,9 +6615,16 @@ PGM <_comment>Portable Graymap Graphic - - - + + + + + + + + + + @@ -6619,13 +6634,33 @@ PXM <_comment>UNIX Portable Bitmap Graphic - - - - + + + + + + + + + + + + +PAM +<_comment>UNIX Portable Bitmap Graphic Arbitrary Map + + + + + + + + + + DNG
(tika) branch TIKA-4217 deleted (was b0d539516)
This is an automated email from the ASF dual-hosted git repository. tallison pushed a change to branch TIKA-4217 in repository https://gitbox.apache.org/repos/asf/tika.git was b0d539516 TIKA-4217 -- require new line or white space as part of bitmap magic The revisions that were on this branch are still contained in other references; therefore, this change does not discard any commits from the repository.
(tika) branch main updated: TIKA-4217 -- require new line or white space as part of bitmap magic (#1674)
This is an automated email from the ASF dual-hosted git repository. tallison pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/main by this push: new 08727d522 TIKA-4217 -- require new line or white space as part of bitmap magic (#1674) 08727d522 is described below commit 08727d5224f7c663a19c572154939a2c140a6773 Author: Tim Allison AuthorDate: Thu Mar 21 11:33:31 2024 -0400 TIKA-4217 -- require new line or white space as part of bitmap magic (#1674) --- .../org/apache/tika/mime/tika-mimetypes.xml| 53 ++ 1 file changed, 44 insertions(+), 9 deletions(-) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index 675ba1180..7176332ef 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -6702,8 +6702,16 @@ PBM <_comment>Portable Bit Map - - + + + + + + + + + + @@ -6713,9 +6721,16 @@ PGM <_comment>Portable Graymap Graphic - - - + + + + + + + + + + @@ -6725,13 +6740,33 @@ PXM <_comment>UNIX Portable Bitmap Graphic - - - - + + + + + + + + + + + + +PAM +<_comment>UNIX Portable Bitmap Graphic Arbitrary Map + + + + + + + + + + DNG
(tika) branch branch_2x updated: Bump log4j2 in prep for 2.x release -- fix bundle
This is an automated email from the ASF dual-hosted git repository. tallison pushed a commit to branch branch_2x in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/branch_2x by this push: new 02ddacdc3 Bump log4j2 in prep for 2.x release -- fix bundle 02ddacdc3 is described below commit 02ddacdc346c4b1e3a0f4f5769e272a8d47cd2ed Author: tallison AuthorDate: Thu Mar 21 11:26:57 2024 -0400 Bump log4j2 in prep for 2.x release -- fix bundle --- tika-bundles/tika-bundle-standard/pom.xml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/tika-bundles/tika-bundle-standard/pom.xml b/tika-bundles/tika-bundle-standard/pom.xml index cbaf6ab07..2028980e6 100644 --- a/tika-bundles/tika-bundle-standard/pom.xml +++ b/tika-bundles/tika-bundle-standard/pom.xml @@ -146,6 +146,8 @@ <_runsystempackages>com.sun.xml.bind.marshaller, com.sun.xml.internal.bind.marshaller <_include>src/main/resources/META-INF/MANIFEST.MF + +<_bundleannotations> org.apache.tika.parser.internal.Activator
(tika) 01/02: TIKA-4211 -- first attempt (#1670)
This is an automated email from the ASF dual-hosted git repository. tallison pushed a commit to branch branch_2x in repository https://gitbox.apache.org/repos/asf/tika.git commit 5b79cfc5c675f3f169dea2521d69cc0565d8b3d0 Author: Tim Allison AuthorDate: Thu Mar 21 08:42:37 2024 -0400 TIKA-4211 -- first attempt (#1670) (cherry picked from commit 7dc3d28a5574f6e40981dca7666ccb97b9ebe467) --- .../tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java index df566a284..38e9c8aac 100644 --- a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java +++ b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java @@ -344,7 +344,8 @@ public class XSLFPowerPointExtractorDecorator extends AbstractOOXMLExtractor { for (String relation : new String[]{XSLFRelation.VML_DRAWING.getRelation(), XSLFRelation.SLIDE_LAYOUT.getRelation(), XSLFRelation.NOTES_MASTER.getRelation(), -XSLFRelation.NOTES.getRelation()}) { +XSLFRelation.NOTES.getRelation(), XSLFRelation.CHART.getRelation(), +XSLFRelation.DIAGRAM_DRAWING.getRelation()}) { try { for (PackageRelationship packageRelationship : slidePart .getRelationshipsByType(relation)) {
(tika) 02/02: Bump log4j2 in prep for 2.x release
This is an automated email from the ASF dual-hosted git repository. tallison pushed a commit to branch branch_2x in repository https://gitbox.apache.org/repos/asf/tika.git commit 4dc687cdcb3b1c84fb7371497aeebc6ff92bf6e1 Author: tallison AuthorDate: Thu Mar 21 11:19:33 2024 -0400 Bump log4j2 in prep for 2.x release --- tika-parent/pom.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tika-parent/pom.xml b/tika-parent/pom.xml index 0b0319968..bafe3e4fa 100644 --- a/tika-parent/pom.xml +++ b/tika-parent/pom.xml @@ -365,7 +365,7 @@ 5.10.2 7.5.5 0.9.3 -2.20.0 +2.23.1 1.18.20 8.11.3
(tika) branch branch_2x updated (fad4b8cc2 -> 4dc687cdc)
This is an automated email from the ASF dual-hosted git repository. tallison pushed a change to branch branch_2x in repository https://gitbox.apache.org/repos/asf/tika.git from fad4b8cc2 TIKA-4216 (#1673) new 5b79cfc5c TIKA-4211 -- first attempt (#1670) new 4dc687cdc Bump log4j2 in prep for 2.x release The 2 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: tika-parent/pom.xml| 2 +- .../tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-)
(tika) branch branch_2x updated: TIKA-4216 (#1673)
This is an automated email from the ASF dual-hosted git repository. tallison pushed a commit to branch branch_2x in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/branch_2x by this push: new fad4b8cc2 TIKA-4216 (#1673) fad4b8cc2 is described below commit fad4b8cc2e83a9bda94dbcf809d57706bbdaddb8 Author: Tim Allison AuthorDate: Thu Mar 21 10:08:05 2024 -0400 TIKA-4216 (#1673) * TIKA-4216 -- Avoid checking for imagemagick if image processing is disabled (cherry picked from commit 237e73f18f46af8322a910178fa8ed99e3710d8f) --- .../apache/tika/parser/ocr/TesseractOCRParser.java | 20 +--- 1 file changed, 17 insertions(+), 3 deletions(-) diff --git a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java index a79e05b1d..a28ae8951 100644 --- a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java +++ b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java @@ -126,6 +126,8 @@ public class TesseractOCRParser extends AbstractExternalProcessParser implements }))); private static volatile boolean HAS_WARNED = false; +private static volatile boolean HAS_CHECKED_FOR_IMAGE_MAGICK = false; + //if a user specifies a custom tess path or tessdata path //load the available languages at initialization time private final Set langs = new HashSet<>(); @@ -190,7 +192,10 @@ public class TesseractOCRParser extends AbstractExternalProcessParser implements return hasTesseract; } -boolean hasImageMagick() throws TikaConfigException { +synchronized boolean hasImageMagick() throws TikaConfigException { +if (HAS_CHECKED_FOR_IMAGE_MAGICK) { +return hasImageMagick; +} // Fetch where the config says to find ImageMagick Program String fullImageMagickPath = imageMagickPath + getImageMagickProg(); @@ -208,7 +213,7 @@ public class TesseractOCRParser extends AbstractExternalProcessParser implements LOG.debug("ImageMagick does not appear to be installed " + "(commandline: " + fullImageMagickPath + ")"); } - +HAS_CHECKED_FOR_IMAGE_MAGICK = true; return hasImageMagick; } @@ -245,6 +250,11 @@ public class TesseractOCRParser extends AbstractExternalProcessParser implements return; } +//if you haven't checked yet, and a per file config requests imagemagick +//and if the default is not to use image processing +if (! HAS_CHECKED_FOR_IMAGE_MAGICK && config.isEnableImagePreprocessing()) { +hasImageMagick = hasImageMagick(); +} try (TemporaryResources tmp = new TemporaryResources()) { TikaInputStream tikaStream = TikaInputStream.get(stream, tmp, metadata); @@ -528,7 +538,11 @@ public class TesseractOCRParser extends AbstractExternalProcessParser implements @Override public void initialize(Map params) throws TikaConfigException { hasTesseract = hasTesseract(); -hasImageMagick = hasImageMagick(); +if (isEnableImagePreprocessing()) { +hasImageMagick = hasImageMagick(); +} else { +hasImageMagick = false; +} if (preloadLangs) { preloadLangs(); if (!StringUtils.isBlank(defaultConfig.getLanguage())) {
(tika) 01/01: TIKA-4217 -- require new line or white space as part of bitmap magic
This is an automated email from the ASF dual-hosted git repository. tallison pushed a commit to branch TIKA-4217 in repository https://gitbox.apache.org/repos/asf/tika.git commit b0d5395162c837dc308171d5bbd589f0248b49dd Author: tallison AuthorDate: Thu Mar 21 10:41:23 2024 -0400 TIKA-4217 -- require new line or white space as part of bitmap magic --- .../org/apache/tika/mime/tika-mimetypes.xml| 53 ++ 1 file changed, 44 insertions(+), 9 deletions(-) diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index 675ba1180..7176332ef 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -6702,8 +6702,16 @@ PBM <_comment>Portable Bit Map - - + + + + + + + + + + @@ -6713,9 +6721,16 @@ PGM <_comment>Portable Graymap Graphic - - - + + + + + + + + + + @@ -6725,13 +6740,33 @@ PXM <_comment>UNIX Portable Bitmap Graphic - - - - + + + + + + + + + + + + +PAM +<_comment>UNIX Portable Bitmap Graphic Arbitrary Map + + + + + + + + + + DNG
(tika) branch TIKA-4217 created (now b0d539516)
This is an automated email from the ASF dual-hosted git repository. tallison pushed a change to branch TIKA-4217 in repository https://gitbox.apache.org/repos/asf/tika.git at b0d539516 TIKA-4217 -- require new line or white space as part of bitmap magic This branch includes the following new commits: new b0d539516 TIKA-4217 -- require new line or white space as part of bitmap magic The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference.
(tika) branch branch_2x updated: TIKA-4215 -- avoid loading all the tika resources just to get the version (#1672)
This is an automated email from the ASF dual-hosted git repository. tallison pushed a commit to branch branch_2x in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/branch_2x by this push: new bf0006163 TIKA-4215 -- avoid loading all the tika resources just to get the version (#1672) bf0006163 is described below commit bf0006163b0d053abb8a79c6146aae30fcfcc46d Author: Tim Allison AuthorDate: Thu Mar 21 10:06:57 2024 -0400 TIKA-4215 -- avoid loading all the tika resources just to get the version (#1672) (cherry picked from commit 85d713a9a671d1e8c31bb4a78c830616c0b3eab5) --- tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java | 2 +- tika-core/src/main/java/org/apache/tika/Tika.java | 4 .../src/main/java/org/apache/tika/server/core/TikaServerProcess.java | 2 +- .../main/java/org/apache/tika/server/core/resource/TikaResource.java | 2 +- .../src/test/java/org/apache/tika/server/core/TikaVersionTest.java| 2 +- .../src/test/java/org/apache/tika/server/core/TikaWelcomeTest.java| 4 ++-- 6 files changed, 10 insertions(+), 6 deletions(-) diff --git a/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java b/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java index eac0e4f9b..3be3da0f9 100644 --- a/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java +++ b/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java @@ -657,7 +657,7 @@ public class TikaCLI { } private void version() { -System.out.println(new Tika().toString()); +System.out.println(Tika.getString()); } private boolean testForHelp(String[] args) { diff --git a/tika-core/src/main/java/org/apache/tika/Tika.java b/tika-core/src/main/java/org/apache/tika/Tika.java index 601703e43..22811f9c0 100644 --- a/tika-core/src/main/java/org/apache/tika/Tika.java +++ b/tika-core/src/main/java/org/apache/tika/Tika.java @@ -672,6 +672,10 @@ public class Tika { //--< Object > public String toString() { +return getString(); +} + +public static String getString() { String version = null; try (InputStream stream = Tika.class diff --git a/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/TikaServerProcess.java b/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/TikaServerProcess.java index a6ba72e81..94b285025 100644 --- a/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/TikaServerProcess.java +++ b/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/TikaServerProcess.java @@ -123,7 +123,7 @@ public class TikaServerProcess { } public static void main(String[] args) throws Exception { -LOG.info("Starting {} server", new Tika()); +LOG.info("Starting {} server", Tika.getString()); try { Options options = getOptions(); CommandLineParser cliParser = new DefaultParser(); diff --git a/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaResource.java b/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaResource.java index aadf86f30..868af43dc 100644 --- a/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaResource.java +++ b/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaResource.java @@ -91,7 +91,7 @@ import org.apache.tika.utils.ExceptionUtils; public class TikaResource { public static final String GREETING = -"This is Tika Server (" + new Tika().toString() + "). Please PUT\n"; +"This is Tika Server (" + Tika.getString() + "). Please PUT\n"; private static final String META_PREFIX = "meta_"; private static final Logger LOG = LoggerFactory.getLogger(TikaResource.class); private static Pattern ALLOWABLE_HEADER_CHARS = Pattern.compile("(?i)^[-/_+\\.A-Z0-9 ]+$"); diff --git a/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaVersionTest.java b/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaVersionTest.java index e4e623fd3..f10948243 100644 --- a/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaVersionTest.java +++ b/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaVersionTest.java @@ -49,7 +49,7 @@ public class TikaVersionTest extends CXFTestBase { WebClient.create(endPoint + VERSION_PATH).type("text/plain").accept("text/plain") .get(); -assertEquals(new Tika().toString(), +assertEquals(Tika.getString(), getStringFromInputStream((InputStream) response.getEntity())); } } diff --git
(tika) branch main updated: TIKA-4216 (#1673)
This is an automated email from the ASF dual-hosted git repository. tallison pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/main by this push: new 237e73f18 TIKA-4216 (#1673) 237e73f18 is described below commit 237e73f18f46af8322a910178fa8ed99e3710d8f Author: Tim Allison AuthorDate: Thu Mar 21 10:08:05 2024 -0400 TIKA-4216 (#1673) * TIKA-4216 -- Avoid checking for imagemagick if image processing is disabled --- .../apache/tika/parser/ocr/TesseractOCRParser.java | 20 +--- 1 file changed, 17 insertions(+), 3 deletions(-) diff --git a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java index a79e05b1d..a28ae8951 100644 --- a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java +++ b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java @@ -126,6 +126,8 @@ public class TesseractOCRParser extends AbstractExternalProcessParser implements }))); private static volatile boolean HAS_WARNED = false; +private static volatile boolean HAS_CHECKED_FOR_IMAGE_MAGICK = false; + //if a user specifies a custom tess path or tessdata path //load the available languages at initialization time private final Set langs = new HashSet<>(); @@ -190,7 +192,10 @@ public class TesseractOCRParser extends AbstractExternalProcessParser implements return hasTesseract; } -boolean hasImageMagick() throws TikaConfigException { +synchronized boolean hasImageMagick() throws TikaConfigException { +if (HAS_CHECKED_FOR_IMAGE_MAGICK) { +return hasImageMagick; +} // Fetch where the config says to find ImageMagick Program String fullImageMagickPath = imageMagickPath + getImageMagickProg(); @@ -208,7 +213,7 @@ public class TesseractOCRParser extends AbstractExternalProcessParser implements LOG.debug("ImageMagick does not appear to be installed " + "(commandline: " + fullImageMagickPath + ")"); } - +HAS_CHECKED_FOR_IMAGE_MAGICK = true; return hasImageMagick; } @@ -245,6 +250,11 @@ public class TesseractOCRParser extends AbstractExternalProcessParser implements return; } +//if you haven't checked yet, and a per file config requests imagemagick +//and if the default is not to use image processing +if (! HAS_CHECKED_FOR_IMAGE_MAGICK && config.isEnableImagePreprocessing()) { +hasImageMagick = hasImageMagick(); +} try (TemporaryResources tmp = new TemporaryResources()) { TikaInputStream tikaStream = TikaInputStream.get(stream, tmp, metadata); @@ -528,7 +538,11 @@ public class TesseractOCRParser extends AbstractExternalProcessParser implements @Override public void initialize(Map params) throws TikaConfigException { hasTesseract = hasTesseract(); -hasImageMagick = hasImageMagick(); +if (isEnableImagePreprocessing()) { +hasImageMagick = hasImageMagick(); +} else { +hasImageMagick = false; +} if (preloadLangs) { preloadLangs(); if (!StringUtils.isBlank(defaultConfig.getLanguage())) {
(tika) branch TIKA-4216 deleted (was 5a4eba49e)
This is an automated email from the ASF dual-hosted git repository. tallison pushed a change to branch TIKA-4216 in repository https://gitbox.apache.org/repos/asf/tika.git was 5a4eba49e TIKA-4216 -- Avoid checking for imagemagick if image processing is disabled -- allow for user specified calls to use imagemagick The revisions that were on this branch are still contained in other references; therefore, this change does not discard any commits from the repository.
(tika) branch TIKA-4215 deleted (was f819cbb43)
This is an automated email from the ASF dual-hosted git repository. tallison pushed a change to branch TIKA-4215 in repository https://gitbox.apache.org/repos/asf/tika.git was f819cbb43 TIKA-4215 -- avoid loading all the tika resources just to get the version The revisions that were on this branch are still contained in other references; therefore, this change does not discard any commits from the repository.
(tika) branch main updated: TIKA-4215 -- avoid loading all the tika resources just to get the version (#1672)
This is an automated email from the ASF dual-hosted git repository. tallison pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/main by this push: new 85d713a9a TIKA-4215 -- avoid loading all the tika resources just to get the version (#1672) 85d713a9a is described below commit 85d713a9a671d1e8c31bb4a78c830616c0b3eab5 Author: Tim Allison AuthorDate: Thu Mar 21 10:06:57 2024 -0400 TIKA-4215 -- avoid loading all the tika resources just to get the version (#1672) --- tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java | 2 +- tika-core/src/main/java/org/apache/tika/Tika.java | 4 .../src/main/java/org/apache/tika/server/core/TikaServerProcess.java | 2 +- .../main/java/org/apache/tika/server/core/resource/TikaResource.java | 2 +- .../src/test/java/org/apache/tika/server/core/TikaVersionTest.java| 2 +- .../src/test/java/org/apache/tika/server/core/TikaWelcomeTest.java| 4 ++-- 6 files changed, 10 insertions(+), 6 deletions(-) diff --git a/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java b/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java index 6ae0f8ca7..bd78d4338 100644 --- a/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java +++ b/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java @@ -657,7 +657,7 @@ public class TikaCLI { } private void version() { -System.out.println(new Tika().toString()); +System.out.println(Tika.getString()); } private boolean testForHelp(String[] args) { diff --git a/tika-core/src/main/java/org/apache/tika/Tika.java b/tika-core/src/main/java/org/apache/tika/Tika.java index 601703e43..22811f9c0 100644 --- a/tika-core/src/main/java/org/apache/tika/Tika.java +++ b/tika-core/src/main/java/org/apache/tika/Tika.java @@ -672,6 +672,10 @@ public class Tika { //--< Object > public String toString() { +return getString(); +} + +public static String getString() { String version = null; try (InputStream stream = Tika.class diff --git a/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/TikaServerProcess.java b/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/TikaServerProcess.java index f5c3cca3a..10fb951e0 100644 --- a/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/TikaServerProcess.java +++ b/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/TikaServerProcess.java @@ -120,7 +120,7 @@ public class TikaServerProcess { } public static void main(String[] args) throws Exception { -LOG.info("Starting {} server", new Tika()); +LOG.info("Starting {} server", Tika.getString()); try { Options options = getOptions(); CommandLineParser cliParser = new DefaultParser(); diff --git a/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaResource.java b/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaResource.java index 857692750..5f0e76ec8 100644 --- a/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaResource.java +++ b/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaResource.java @@ -91,7 +91,7 @@ import org.apache.tika.utils.ExceptionUtils; public class TikaResource { public static final String GREETING = -"This is Tika Server (" + new Tika().toString() + "). Please PUT\n"; +"This is Tika Server (" + Tika.getString() + "). Please PUT\n"; private static final String META_PREFIX = "meta_"; private static final Logger LOG = LoggerFactory.getLogger(TikaResource.class); private static Pattern ALLOWABLE_HEADER_CHARS = Pattern.compile("(?i)^[-/_+\\.A-Z0-9 ]+$"); diff --git a/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaVersionTest.java b/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaVersionTest.java index b1a81f230..ed7471f50 100644 --- a/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaVersionTest.java +++ b/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaVersionTest.java @@ -49,7 +49,7 @@ public class TikaVersionTest extends CXFTestBase { WebClient.create(endPoint + VERSION_PATH).type("text/plain").accept("text/plain") .get(); -assertEquals(new Tika().toString(), +assertEquals(Tika.getString(), getStringFromInputStream((InputStream) response.getEntity())); } } diff --git a/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaWelcomeTest.java
(tika) branch TIKA-4216 updated (ba2d729af -> 5a4eba49e)
This is an automated email from the ASF dual-hosted git repository. tallison pushed a change to branch TIKA-4216 in repository https://gitbox.apache.org/repos/asf/tika.git from ba2d729af TIKA-4216 -- Avoid checking for imagemagick if image processing is disabled add 5a4eba49e TIKA-4216 -- Avoid checking for imagemagick if image processing is disabled -- allow for user specified calls to use imagemagick No new revisions were added by this update. Summary of changes: .../org/apache/tika/parser/ocr/TesseractOCRParser.java | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-)
(tika) branch TIKA-4216 created (now ba2d729af)
This is an automated email from the ASF dual-hosted git repository. tallison pushed a change to branch TIKA-4216 in repository https://gitbox.apache.org/repos/asf/tika.git at ba2d729af TIKA-4216 -- Avoid checking for imagemagick if image processing is disabled This branch includes the following new commits: new ba2d729af TIKA-4216 -- Avoid checking for imagemagick if image processing is disabled The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference.
(tika) 01/01: TIKA-4216 -- Avoid checking for imagemagick if image processing is disabled
This is an automated email from the ASF dual-hosted git repository. tallison pushed a commit to branch TIKA-4216 in repository https://gitbox.apache.org/repos/asf/tika.git commit ba2d729af6b3194b9ba81d4041016f6d8e870e99 Author: tallison AuthorDate: Thu Mar 21 09:30:24 2024 -0400 TIKA-4216 -- Avoid checking for imagemagick if image processing is disabled --- .../main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java| 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java index a79e05b1d..aa26f4688 100644 --- a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java +++ b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java @@ -528,7 +528,11 @@ public class TesseractOCRParser extends AbstractExternalProcessParser implements @Override public void initialize(Map params) throws TikaConfigException { hasTesseract = hasTesseract(); -hasImageMagick = hasImageMagick(); +if (isEnableImagePreprocessing()) { +hasImageMagick = hasImageMagick(); +} else { +hasImageMagick = false; +} if (preloadLangs) { preloadLangs(); if (!StringUtils.isBlank(defaultConfig.getLanguage())) {
(tika) branch TIKA-4215 created (now f819cbb43)
This is an automated email from the ASF dual-hosted git repository. tallison pushed a change to branch TIKA-4215 in repository https://gitbox.apache.org/repos/asf/tika.git at f819cbb43 TIKA-4215 -- avoid loading all the tika resources just to get the version This branch includes the following new commits: new f819cbb43 TIKA-4215 -- avoid loading all the tika resources just to get the version The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference.
(tika) 01/01: TIKA-4215 -- avoid loading all the tika resources just to get the version
This is an automated email from the ASF dual-hosted git repository. tallison pushed a commit to branch TIKA-4215 in repository https://gitbox.apache.org/repos/asf/tika.git commit f819cbb431646baebc68e07a1771e768ca54a04a Author: tallison AuthorDate: Thu Mar 21 09:25:41 2024 -0400 TIKA-4215 -- avoid loading all the tika resources just to get the version --- tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java | 2 +- tika-core/src/main/java/org/apache/tika/Tika.java | 4 .../src/main/java/org/apache/tika/server/core/TikaServerProcess.java | 2 +- .../main/java/org/apache/tika/server/core/resource/TikaResource.java | 2 +- .../src/test/java/org/apache/tika/server/core/TikaVersionTest.java| 2 +- .../src/test/java/org/apache/tika/server/core/TikaWelcomeTest.java| 4 ++-- 6 files changed, 10 insertions(+), 6 deletions(-) diff --git a/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java b/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java index 6ae0f8ca7..bd78d4338 100644 --- a/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java +++ b/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java @@ -657,7 +657,7 @@ public class TikaCLI { } private void version() { -System.out.println(new Tika().toString()); +System.out.println(Tika.getString()); } private boolean testForHelp(String[] args) { diff --git a/tika-core/src/main/java/org/apache/tika/Tika.java b/tika-core/src/main/java/org/apache/tika/Tika.java index 601703e43..22811f9c0 100644 --- a/tika-core/src/main/java/org/apache/tika/Tika.java +++ b/tika-core/src/main/java/org/apache/tika/Tika.java @@ -672,6 +672,10 @@ public class Tika { //--< Object > public String toString() { +return getString(); +} + +public static String getString() { String version = null; try (InputStream stream = Tika.class diff --git a/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/TikaServerProcess.java b/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/TikaServerProcess.java index f5c3cca3a..10fb951e0 100644 --- a/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/TikaServerProcess.java +++ b/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/TikaServerProcess.java @@ -120,7 +120,7 @@ public class TikaServerProcess { } public static void main(String[] args) throws Exception { -LOG.info("Starting {} server", new Tika()); +LOG.info("Starting {} server", Tika.getString()); try { Options options = getOptions(); CommandLineParser cliParser = new DefaultParser(); diff --git a/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaResource.java b/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaResource.java index 857692750..5f0e76ec8 100644 --- a/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaResource.java +++ b/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaResource.java @@ -91,7 +91,7 @@ import org.apache.tika.utils.ExceptionUtils; public class TikaResource { public static final String GREETING = -"This is Tika Server (" + new Tika().toString() + "). Please PUT\n"; +"This is Tika Server (" + Tika.getString() + "). Please PUT\n"; private static final String META_PREFIX = "meta_"; private static final Logger LOG = LoggerFactory.getLogger(TikaResource.class); private static Pattern ALLOWABLE_HEADER_CHARS = Pattern.compile("(?i)^[-/_+\\.A-Z0-9 ]+$"); diff --git a/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaVersionTest.java b/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaVersionTest.java index b1a81f230..ed7471f50 100644 --- a/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaVersionTest.java +++ b/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaVersionTest.java @@ -49,7 +49,7 @@ public class TikaVersionTest extends CXFTestBase { WebClient.create(endPoint + VERSION_PATH).type("text/plain").accept("text/plain") .get(); -assertEquals(new Tika().toString(), +assertEquals(Tika.getString(), getStringFromInputStream((InputStream) response.getEntity())); } } diff --git a/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaWelcomeTest.java b/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaWelcomeTest.java index 3c97d329c..428ec71f0 100644 --- a/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaWelcomeTest.java +++
(tika) branch TIKA-4211 deleted (was 2239ea9a8)
This is an automated email from the ASF dual-hosted git repository. tallison pushed a change to branch TIKA-4211 in repository https://gitbox.apache.org/repos/asf/tika.git was 2239ea9a8 TIKA-4211 -- first attempt The revisions that were on this branch are still contained in other references; therefore, this change does not discard any commits from the repository.
(tika) branch main updated: TIKA-4211 -- first attempt (#1670)
This is an automated email from the ASF dual-hosted git repository. tallison pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/main by this push: new 7dc3d28a5 TIKA-4211 -- first attempt (#1670) 7dc3d28a5 is described below commit 7dc3d28a5574f6e40981dca7666ccb97b9ebe467 Author: Tim Allison AuthorDate: Thu Mar 21 08:42:37 2024 -0400 TIKA-4211 -- first attempt (#1670) --- .../tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java index df566a284..38e9c8aac 100644 --- a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java +++ b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java @@ -344,7 +344,8 @@ public class XSLFPowerPointExtractorDecorator extends AbstractOOXMLExtractor { for (String relation : new String[]{XSLFRelation.VML_DRAWING.getRelation(), XSLFRelation.SLIDE_LAYOUT.getRelation(), XSLFRelation.NOTES_MASTER.getRelation(), -XSLFRelation.NOTES.getRelation()}) { +XSLFRelation.NOTES.getRelation(), XSLFRelation.CHART.getRelation(), +XSLFRelation.DIAGRAM_DRAWING.getRelation()}) { try { for (PackageRelationship packageRelationship : slidePart .getRelationshipsByType(relation)) {
(tika) branch TIKA-4213 deleted (was bfaecac53)
This is an automated email from the ASF dual-hosted git repository. tallison pushed a change to branch TIKA-4213 in repository https://gitbox.apache.org/repos/asf/tika.git was bfaecac53 TIKA-4213 -- use more standard sql -- timestamp with time zone The revisions that were on this branch are still contained in other references; therefore, this change does not discard any commits from the repository.
(tika) branch main updated: TIKA-4213 -- improve jdbc pipes reporter (#1669)
This is an automated email from the ASF dual-hosted git repository. tallison pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/main by this push: new e63730e12 TIKA-4213 -- improve jdbc pipes reporter (#1669) e63730e12 is described below commit e63730e126e74b4ac36e5f2b8c6790963eb41c14 Author: Tim Allison AuthorDate: Thu Mar 21 08:42:25 2024 -0400 TIKA-4213 -- improve jdbc pipes reporter (#1669) * TIKA-4213 -- improve jdbc pipes reporter --- .../pipes/reporters/jdbc/JDBCPipesReporter.java| 52 -- 1 file changed, 29 insertions(+), 23 deletions(-) diff --git a/tika-pipes/tika-pipes-reporters/tika-pipes-reporter-jdbc/src/main/java/org/apache/tika/pipes/reporters/jdbc/JDBCPipesReporter.java b/tika-pipes/tika-pipes-reporters/tika-pipes-reporter-jdbc/src/main/java/org/apache/tika/pipes/reporters/jdbc/JDBCPipesReporter.java index 0082eb9de..ee52bf80f 100644 --- a/tika-pipes/tika-pipes-reporters/tika-pipes-reporter-jdbc/src/main/java/org/apache/tika/pipes/reporters/jdbc/JDBCPipesReporter.java +++ b/tika-pipes/tika-pipes-reporters/tika-pipes-reporter-jdbc/src/main/java/org/apache/tika/pipes/reporters/jdbc/JDBCPipesReporter.java @@ -22,6 +22,8 @@ import java.sql.DriverManager; import java.sql.PreparedStatement; import java.sql.SQLException; import java.sql.Statement; +import java.sql.Timestamp; +import java.time.Instant; import java.util.ArrayList; import java.util.List; import java.util.Map; @@ -68,7 +70,7 @@ public class JDBCPipesReporter extends PipesReporterBase implements Initializabl private String connectionString; private Optional postConnectionString = Optional.empty(); -private final ArrayBlockingQueue queue = +private final ArrayBlockingQueue queue = new ArrayBlockingQueue(ARRAY_BLOCKING_QUEUE_SIZE); CompletableFuture reportWorkerFuture; @@ -146,7 +148,7 @@ public class JDBCPipesReporter extends PipesReporterBase implements Initializabl return; } try { -queue.offer(new KeyStatusPair(t.getEmitKey().getEmitKey(), result.getStatus()), +queue.offer(new IdStatusPair(t.getId(), result.getStatus()), MAX_WAIT_MILLIS, TimeUnit.MILLISECONDS); } catch (InterruptedException e) { //swallow @@ -167,7 +169,7 @@ public class JDBCPipesReporter extends PipesReporterBase implements Initializabl @Override public void close() throws IOException { try { -queue.offer(KeyStatusPair.END_SEMAPHORE, 60, TimeUnit.SECONDS); +queue.offer(IdStatusPair.END_SEMAPHORE, 60, TimeUnit.SECONDS); } catch (InterruptedException e) { return; } @@ -186,20 +188,20 @@ public class JDBCPipesReporter extends PipesReporterBase implements Initializabl } } -private static class KeyStatusPair { +private static class IdStatusPair { -static KeyStatusPair END_SEMAPHORE = new KeyStatusPair(null, null); -private final String emitKey; +static IdStatusPair END_SEMAPHORE = new IdStatusPair(null, null); +private final String id; private final PipesResult.STATUS status; -public KeyStatusPair(String emitKey, PipesResult.STATUS status) { -this.emitKey = emitKey; +public IdStatusPair(String id, PipesResult.STATUS status) { +this.id = id; this.status = status; } @Override public String toString() { -return "KeyStatusPair{" + "emitKey='" + emitKey + '\'' + ", status=" + status + '}'; +return "KeyStatusPair{" + "id='" + id + '\'' + ", status=" + status + '}'; } } @@ -208,18 +210,18 @@ public class JDBCPipesReporter extends PipesReporterBase implements Initializabl private static final int MAX_TRIES = 3; private final String connectionString; private final Optional postConnectionString; -private final ArrayBlockingQueue queue; +private final ArrayBlockingQueue queue; private final int cacheSize; private final long reportWithinMs; -List cache = new ArrayList<>(); +List cache = new ArrayList<>(); private Connection connection; private PreparedStatement insert; public ReportWorker(String connectionString, Optional postConnectionString, -ArrayBlockingQueue queue, int cacheSize, +ArrayBlockingQueue queue, int cacheSize, long reportWithinMs) { this.connectionString = connectionString; this.postConnectionString = postConnectionString; @@ -242,18 +244,19 @@ public class JDBCPipesReporter extends PipesReporterBase implements Initializabl public void run() {
(tika) branch branch_2x updated: TIKA-4162: update maven-plugin-annotations
This is an automated email from the ASF dual-hosted git repository. tilman pushed a commit to branch branch_2x in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/branch_2x by this push: new 49fedc6b8 TIKA-4162: update maven-plugin-annotations 49fedc6b8 is described below commit 49fedc6b8cb5a7e55c40857afbf905af120c9dfd Author: Tilman Hausherr AuthorDate: Thu Mar 21 10:10:00 2024 +0100 TIKA-4162: update maven-plugin-annotations --- tika-parent/pom.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tika-parent/pom.xml b/tika-parent/pom.xml index db47346ab..0b0319968 100644 --- a/tika-parent/pom.xml +++ b/tika-parent/pom.xml @@ -963,7 +963,7 @@ org.apache.maven.plugin-tools maven-plugin-annotations -3.10.2 +3.11.0
(tika) branch branch_2x updated: TIKA-4162: update aws
This is an automated email from the ASF dual-hosted git repository. tilman pushed a commit to branch branch_2x in repository https://gitbox.apache.org/repos/asf/tika.git The following commit(s) were added to refs/heads/branch_2x by this push: new 92cf36f8d TIKA-4162: update aws 92cf36f8d is described below commit 92cf36f8d9cf6991fbeb092f10d2acb1dde58473 Author: Tilman Hausherr AuthorDate: Thu Mar 21 09:36:59 2024 +0100 TIKA-4162: update aws --- tika-parent/pom.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tika-parent/pom.xml b/tika-parent/pom.xml index 681a91144..db47346ab 100644 --- a/tika-parent/pom.xml +++ b/tika-parent/pom.xml @@ -307,7 +307,7 @@ 2.36.0 -1.12.683 +1.12.684
(tika) branch dependabot/maven/aws.version-1.12.684 deleted (was 9ea184af5)
This is an automated email from the ASF dual-hosted git repository. github-bot pushed a change to branch dependabot/maven/aws.version-1.12.684 in repository https://gitbox.apache.org/repos/asf/tika.git was 9ea184af5 Bump aws.version from 1.12.683 to 1.12.684 The revisions that were on this branch are still contained in other references; therefore, this change does not discard any commits from the repository.
(tika) 01/01: Merge pull request #1671 from apache/dependabot/maven/aws.version-1.12.684
This is an automated email from the ASF dual-hosted git repository. tilman pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/tika.git commit 96fd5fd6cff4366edd0f1136106fa1ee4916ec5f Merge: eac6f090b 9ea184af5 Author: Tilman Hausherr AuthorDate: Thu Mar 21 07:19:17 2024 +0100 Merge pull request #1671 from apache/dependabot/maven/aws.version-1.12.684 Bump aws.version from 1.12.683 to 1.12.684 tika-parent/pom.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
(tika) branch main updated (eac6f090b -> 96fd5fd6c)
This is an automated email from the ASF dual-hosted git repository. tilman pushed a change to branch main in repository https://gitbox.apache.org/repos/asf/tika.git from eac6f090b Merge pull request #1668 from apache/dependabot/maven/aws.version-1.12.683 add 9ea184af5 Bump aws.version from 1.12.683 to 1.12.684 new 96fd5fd6c Merge pull request #1671 from apache/dependabot/maven/aws.version-1.12.684 The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: tika-parent/pom.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)