This is an automated email from the ASF dual-hosted git repository.
ndipiazza pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git
The following commit(s) were added to refs/heads/main by this push:
new b5aaa897e TIKA-4600: Add E2E tests for tika-grpc (#2500)
b5aaa897e is described below
commit b5aaa897eb8c0073bd7cd51ec2d93401a7bff141
Author: Nicholas DiPiazza <[email protected]>
AuthorDate: Mon Dec 29 04:46:35 2025 -0600
TIKA-4600: Add E2E tests for tika-grpc (#2500)
* TIKA-4600: Add E2E tests for tika-grpc
- Created tika-e2e-tests/ as standalone module (not in parent POM)
- Integrated tika-grpc-e2e-test as tika-e2e-tests/tika-grpc
- Added parent POM with shared dependency management
- Included sample configurations for various scenarios
- Tests use Testcontainers and GovDocs1 corpus
- Tests validate filesystem fetcher and Ignite config store
- Module can be built and tested independently
* Remove JIRA references and future modules from READMEs
---
tika-e2e-tests/README.md | 59 +++++
tika-e2e-tests/pom.xml | 143 ++++++++++
tika-e2e-tests/tika-grpc/README.md | 144 +++++++++++
tika-e2e-tests/tika-grpc/pom.xml | 130 ++++++++++
.../tika/parser/ocr/TesseractOCRConfig.properties | 25 ++
.../customocr/tika-config-inline.json | 26 ++
.../customocr/tika-config-inline.xml | 49 ++++
.../customocr/tika-config-rendered.json | 28 ++
.../customocr/tika-config-rendered.xml | 55 ++++
.../tika/parser/journal/GrobidExtractor.properties | 16 ++
.../sample-configs/grobid/tika-config.json | 23 ++
.../sample-configs/grobid/tika-config.xml | 41 +++
.../tika-grpc/sample-configs/ignite/README.md | 117 +++++++++
.../sample-configs/ignite/tika-config-ignite.json | 24 ++
.../sample-configs/ner/run_tika_server.sh | 62 +++++
.../tika-grpc/sample-configs/ner/tika-config.json | 26 ++
.../tika-grpc/sample-configs/ner/tika-config.xml | 45 ++++
.../tika-grpc/sample-configs/test-simple.json | 20 ++
.../vision/inception-rest-caption.json | 18 ++
.../vision/inception-rest-caption.xml | 32 +++
.../vision/inception-rest-video.json | 18 ++
.../sample-configs/vision/inception-rest-video.xml | 32 +++
.../sample-configs/vision/inception-rest.json | 18 ++
.../sample-configs/vision/inception-rest.xml | 32 +++
.../org/apache/tika/pipes/ExternalTestBase.java | 183 +++++++++++++
.../pipes/filesystem/FileSystemFetcherTest.java | 141 ++++++++++
.../tika/pipes/ignite/IgniteConfigStoreTest.java | 288 +++++++++++++++++++++
.../java/org/apache/tika/pipes/ignite/README.md | 172 ++++++++++++
.../src/test/resources/docker-compose-ignite.yml | 25 ++
.../src/test/resources/docker-compose.yml | 16 ++
.../tika-grpc/src/test/resources/log4j2.xml | 19 ++
.../src/test/resources/tika-config-ignite.json | 52 ++++
.../tika-grpc/src/test/resources/tika-config.json | 25 ++
33 files changed, 2104 insertions(+)
diff --git a/tika-e2e-tests/README.md b/tika-e2e-tests/README.md
new file mode 100644
index 000000000..8c419571a
--- /dev/null
+++ b/tika-e2e-tests/README.md
@@ -0,0 +1,59 @@
+# Apache Tika End-to-End Tests
+
+End-to-end integration tests for Apache Tika components.
+
+## Overview
+
+This module contains standalone end-to-end (E2E) tests for various Apache Tika
distribution formats and deployment modes. Unlike unit and integration tests in
the main Tika build, these E2E tests validate complete deployment scenarios
using Docker containers and real-world test data.
+
+**Note:** This module is intentionally **NOT** included in the main Tika
parent POM. It is designed to be built and run independently to avoid slowing
down the primary build process.
+
+## Test Modules
+
+- **tika-grpc** - E2E tests for tika-grpc server
+
+## Prerequisites
+
+- Java 17 or later
+- Maven 3.6 or later
+- Docker and Docker Compose
+- Internet connection (for downloading test documents)
+
+## Building All E2E Tests
+
+From this directory:
+
+```bash
+mvn clean install
+```
+
+## Running All E2E Tests
+
+```bash
+mvn test
+```
+
+## Running Specific Test Module
+
+```bash
+cd tika-grpc
+mvn test
+```
+
+## Why Standalone?
+
+The E2E tests are kept separate from the main build because they:
+
+- Have different build requirements (Docker, Testcontainers)
+- Take significantly longer to run than unit tests
+- Require external resources (test corpora, Docker images)
+- Can be run independently in CI/CD pipelines
+- Allow developers to run them selectively
+
+## Integration with CI/CD
+
+These tests can be integrated into the release pipeline as a separate step.
+
+## License
+
+Licensed under the Apache License, Version 2.0. See the main Tika LICENSE.txt
file for details.
diff --git a/tika-e2e-tests/pom.xml b/tika-e2e-tests/pom.xml
new file mode 100644
index 000000000..67d565c87
--- /dev/null
+++ b/tika-e2e-tests/pom.xml
@@ -0,0 +1,143 @@
+<?xml version="1.0" encoding="UTF-8"?>
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+ xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
https://maven.apache.org/xsd/maven-4.0.0.xsd">
+ <modelVersion>4.0.0</modelVersion>
+
+ <groupId>org.apache.tika</groupId>
+ <artifactId>tika-e2e-tests</artifactId>
+ <version>4.0.0-SNAPSHOT</version>
+ <packaging>pom</packaging>
+ <name>Apache Tika End-to-End Tests</name>
+ <description>End-to-end integration tests for Apache Tika
components</description>
+
+ <properties>
+ <maven.compiler.source>17</maven.compiler.source>
+ <maven.compiler.target>17</maven.compiler.target>
+ <maven.compiler.release>17</maven.compiler.release>
+ <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
+
+ <!-- Tika version -->
+ <tika.version>4.0.0-SNAPSHOT</tika.version>
+
+ <!-- Test dependencies -->
+ <junit.version>5.11.4</junit.version>
+ <testcontainers.version>1.20.4</testcontainers.version>
+
+ <!-- Logging -->
+ <slf4j.version>2.0.16</slf4j.version>
+ <log4j.version>2.24.3</log4j.version>
+
+ <!-- Other -->
+ <lombok.version>1.18.32</lombok.version>
+ <jackson.version>2.18.2</jackson.version>
+ </properties>
+
+ <modules>
+ <module>tika-grpc</module>
+ </modules>
+
+ <dependencyManagement>
+ <dependencies>
+ <!-- JUnit 5 -->
+ <dependency>
+ <groupId>org.junit.jupiter</groupId>
+ <artifactId>junit-jupiter-engine</artifactId>
+ <version>${junit.version}</version>
+ <scope>test</scope>
+ </dependency>
+ <dependency>
+ <groupId>org.junit.jupiter</groupId>
+ <artifactId>junit-jupiter-api</artifactId>
+ <version>${junit.version}</version>
+ <scope>test</scope>
+ </dependency>
+
+ <!-- Testcontainers -->
+ <dependency>
+ <groupId>org.testcontainers</groupId>
+ <artifactId>testcontainers</artifactId>
+ <version>${testcontainers.version}</version>
+ <scope>test</scope>
+ </dependency>
+ <dependency>
+ <groupId>org.testcontainers</groupId>
+ <artifactId>junit-jupiter</artifactId>
+ <version>${testcontainers.version}</version>
+ <scope>test</scope>
+ </dependency>
+
+ <!-- Logging -->
+ <dependency>
+ <groupId>org.apache.logging.log4j</groupId>
+ <artifactId>log4j-core</artifactId>
+ <version>${log4j.version}</version>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.logging.log4j</groupId>
+ <artifactId>log4j-slf4j2-impl</artifactId>
+ <version>${log4j.version}</version>
+ </dependency>
+ <dependency>
+ <groupId>org.slf4j</groupId>
+ <artifactId>slf4j-api</artifactId>
+ <version>${slf4j.version}</version>
+ </dependency>
+
+ <!-- Jackson for JSON -->
+ <dependency>
+ <groupId>com.fasterxml.jackson.core</groupId>
+ <artifactId>jackson-databind</artifactId>
+ <version>${jackson.version}</version>
+ </dependency>
+
+ <!-- Lombok -->
+ <dependency>
+ <groupId>org.projectlombok</groupId>
+ <artifactId>lombok</artifactId>
+ <version>${lombok.version}</version>
+ <optional>true</optional>
+ </dependency>
+ </dependencies>
+ </dependencyManagement>
+
+ <build>
+ <pluginManagement>
+ <plugins>
+ <plugin>
+ <groupId>org.apache.maven.plugins</groupId>
+ <artifactId>maven-compiler-plugin</artifactId>
+ <version>3.13.0</version>
+ <configuration>
+ <release>17</release>
+ </configuration>
+ </plugin>
+ <plugin>
+ <groupId>org.apache.maven.plugins</groupId>
+ <artifactId>maven-surefire-plugin</artifactId>
+ <version>3.5.2</version>
+ </plugin>
+ </plugins>
+ </pluginManagement>
+ </build>
+</project>
diff --git a/tika-e2e-tests/tika-grpc/README.md
b/tika-e2e-tests/tika-grpc/README.md
new file mode 100644
index 000000000..12d3fca1b
--- /dev/null
+++ b/tika-e2e-tests/tika-grpc/README.md
@@ -0,0 +1,144 @@
+# Tika gRPC End-to-End Tests
+
+End-to-end integration tests for Apache Tika gRPC Server using Testcontainers.
+
+## Overview
+
+This test module validates the functionality of Apache Tika gRPC Server by:
+- Starting a tika-grpc Docker container using Docker Compose
+- Loading test documents from the GovDocs1 corpus
+- Testing various fetchers (filesystem, Ignite config store, etc.)
+- Verifying parsing results and metadata extraction
+
+## Prerequisites
+
+- Java 17 or later
+- Maven 3.6 or later
+- Docker and Docker Compose
+- Internet connection (for downloading test documents)
+- Docker image `apache/tika-grpc:local` (see below)
+
+## Building
+
+```bash
+mvn clean install
+```
+
+## Running Tests
+
+### Run all tests
+
+```bash
+mvn test
+```
+
+### Run specific test
+
+```bash
+mvn test -Dtest=FileSystemFetcherTest
+mvn test -Dtest=IgniteConfigStoreTest
+```
+
+### Configure test document range
+
+By default, only the first batch of GovDocs1 documents (001.zip) is
downloaded. To test with more documents:
+
+```bash
+mvn test -Dgovdocs1.fromIndex=1 -Dgovdocs1.toIndex=5
+```
+
+This will download and test with batches 001.zip through 005.zip.
+
+### Limit number of documents to process
+
+To limit the test to only process a specific number of documents (useful for
quick testing):
+
+```bash
+mvn test -Dcorpa.numdocs=10
+```
+
+This will process only the first 10 documents instead of all documents in the
corpus. Omit this parameter or set to -1 to process all documents.
+
+**Examples:**
+
+```bash
+# Test with just 5 documents
+mvn test -Dcorpa.numdocs=5
+
+# Test with 100 documents from multiple batches
+mvn test -Dgovdocs1.fromIndex=1 -Dgovdocs1.toIndex=2 -Dcorpa.numdocs=100
+
+# Test all documents (default behavior)
+mvn test
+```
+
+## Test Structure
+
+- `ExternalTestBase.java` - Base class for all tests
+ - Manages Docker Compose containers
+ - Downloads and extracts GovDocs1 test corpus
+ - Provides utility methods for gRPC communication
+
+- `filesystem/FileSystemFetcherTest.java` - Tests for filesystem fetcher
+ - Tests fetching and parsing files from local filesystem
+ - Verifies all documents are processed
+
+- `ignite/IgniteConfigStoreTest.java` - Tests for Ignite config store
+ - Tests configuration storage and retrieval via Ignite
+ - Validates config persistence
+
+## GovDocs1 Test Corpus
+
+The tests use the [GovDocs1](https://digitalcorpora.org/corpora/govdocs)
corpus, a collection of real-world documents from US government websites.
Documents are automatically downloaded and cached in `target/govdocs1/`.
+
+## Docker Image
+
+The tests expect a Docker image named `apache/tika-grpc:local`. Build one
using:
+
+```bash
+cd /path/to/tika-docker/tika-grpc
+./build-from-branch.sh -l /path/to/tika -t local
+```
+
+Or build from the main Tika repository and tag it:
+
+```bash
+cd /path/to/tika
+mvn clean install -DskipTests
+cd tika-grpc
+# Follow tika-grpc Docker build instructions
+```
+
+## Sample Configurations
+
+The `sample-configs/` directory contains example Tika configuration files for
various scenarios:
+- `customocr/` - Custom OCR configurations
+- `grobid/` - GROBID PDF parsing configuration
+- `ignite/` - Ignite config store examples
+- `ner/` - Named Entity Recognition configuration
+- `vision/` - Computer vision and image analysis configs
+
+## Logs
+
+Test logs are output to console. Docker container logs are also captured and
displayed.
+
+## Troubleshooting
+
+**Container fails to start:**
+- Ensure Docker is running
+- Check that port 50052 is available
+- Verify the `apache/tika-grpc:local` image exists: `docker images | grep
tika-grpc`
+
+**Tests timeout:**
+- Increase timeout in test class
+- Check Docker container logs for errors
+- Ensure sufficient memory is available to Docker
+
+**Download failures:**
+- Check internet connection
+- GovDocs1 files are downloaded from digitalcorpora.org
+- Downloaded files are cached in `target/govdocs1/`
+
+## License
+
+Licensed under the Apache License, Version 2.0. See the main Tika LICENSE.txt
file for details.
diff --git a/tika-e2e-tests/tika-grpc/pom.xml b/tika-e2e-tests/tika-grpc/pom.xml
new file mode 100644
index 000000000..7148c37b8
--- /dev/null
+++ b/tika-e2e-tests/tika-grpc/pom.xml
@@ -0,0 +1,130 @@
+<?xml version="1.0" encoding="UTF-8"?>
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+ xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
https://maven.apache.org/xsd/maven-4.0.0.xsd">
+ <modelVersion>4.0.0</modelVersion>
+
+ <parent>
+ <groupId>org.apache.tika</groupId>
+ <artifactId>tika-e2e-tests</artifactId>
+ <version>4.0.0-SNAPSHOT</version>
+ </parent>
+
+ <artifactId>tika-grpc-e2e-test</artifactId>
+ <name>Apache Tika gRPC End-to-End Tests</name>
+ <description>End-to-end tests for Apache Tika gRPC Server using test
containers</description>
+
+ <dependencies>
+ <!-- Tika gRPC -->
+ <dependency>
+ <groupId>org.apache.tika</groupId>
+ <artifactId>tika-grpc</artifactId>
+ <version>${tika.version}</version>
+ </dependency>
+
+ <!-- Tika Fetchers -->
+ <dependency>
+ <groupId>org.apache.tika</groupId>
+ <artifactId>tika-pipes-file-system</artifactId>
+ <version>${tika.version}</version>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.tika</groupId>
+ <artifactId>tika-pipes-core</artifactId>
+ <version>${tika.version}</version>
+ </dependency>
+
+ <!-- Jackson for JSON -->
+ <dependency>
+ <groupId>com.fasterxml.jackson.core</groupId>
+ <artifactId>jackson-databind</artifactId>
+ </dependency>
+
+ <!-- Lombok -->
+ <dependency>
+ <groupId>org.projectlombok</groupId>
+ <artifactId>lombok</artifactId>
+ <optional>true</optional>
+ </dependency>
+
+ <!-- JUnit 5 -->
+ <dependency>
+ <groupId>org.junit.jupiter</groupId>
+ <artifactId>junit-jupiter-engine</artifactId>
+ <scope>test</scope>
+ </dependency>
+ <dependency>
+ <groupId>org.junit.jupiter</groupId>
+ <artifactId>junit-jupiter-api</artifactId>
+ <scope>test</scope>
+ </dependency>
+
+ <!-- Testcontainers -->
+ <dependency>
+ <groupId>org.testcontainers</groupId>
+ <artifactId>testcontainers</artifactId>
+ <scope>test</scope>
+ </dependency>
+ <dependency>
+ <groupId>org.testcontainers</groupId>
+ <artifactId>junit-jupiter</artifactId>
+ <scope>test</scope>
+ </dependency>
+
+ <!-- Logging -->
+ <dependency>
+ <groupId>org.apache.logging.log4j</groupId>
+ <artifactId>log4j-core</artifactId>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.logging.log4j</groupId>
+ <artifactId>log4j-slf4j2-impl</artifactId>
+ </dependency>
+ <dependency>
+ <groupId>org.slf4j</groupId>
+ <artifactId>slf4j-api</artifactId>
+ </dependency>
+ </dependencies>
+
+ <build>
+ <plugins>
+ <plugin>
+ <groupId>org.apache.maven.plugins</groupId>
+ <artifactId>maven-compiler-plugin</artifactId>
+ </plugin>
+ <plugin>
+ <groupId>org.apache.maven.plugins</groupId>
+ <artifactId>maven-surefire-plugin</artifactId>
+ <configuration>
+ <includes>
+ <include>**/*Test.java</include>
+ </includes>
+ <systemPropertyVariables>
+ <govdocs1.fromIndex>1</govdocs1.fromIndex>
+ <govdocs1.toIndex>1</govdocs1.toIndex>
+ </systemPropertyVariables>
+ </configuration>
+ </plugin>
+ </plugins>
+ </build>
+</project>
diff --git
a/tika-e2e-tests/tika-grpc/sample-configs/customocr/org/apache/tika/parser/ocr/TesseractOCRConfig.properties
b/tika-e2e-tests/tika-grpc/sample-configs/customocr/org/apache/tika/parser/ocr/TesseractOCRConfig.properties
new file mode 100644
index 000000000..b4b787ffc
--- /dev/null
+++
b/tika-e2e-tests/tika-grpc/sample-configs/customocr/org/apache/tika/parser/ocr/TesseractOCRConfig.properties
@@ -0,0 +1,25 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# You customise or add the settings you want here
+language=eng+spa+fra+deu+ita
+timeout=240
+minFileSizeToOcr=1
+enableImageProcessing=0
+density=200
+depth=8
+filter=box
+resize=300
+applyRotation=true
\ No newline at end of file
diff --git
a/tika-e2e-tests/tika-grpc/sample-configs/customocr/tika-config-inline.json
b/tika-e2e-tests/tika-grpc/sample-configs/customocr/tika-config-inline.json
new file mode 100644
index 000000000..cadb8db5a
--- /dev/null
+++ b/tika-e2e-tests/tika-grpc/sample-configs/customocr/tika-config-inline.json
@@ -0,0 +1,26 @@
+{
+ "async": {
+ "staleFetcherTimeoutSeconds": 600,
+ "staleFetcherDelaySeconds": 60
+ },
+ "pipes": {
+ "numClients": 2,
+ "forkedJvmArgs": [
+ "-Xmx1g",
+ "-XX:ParallelGCThreads=2"
+ ],
+ "timeoutMillis": 60000,
+ "maxForEmitBatchBytes": -1
+ },
+ "parsers": [
+ {
+ "class": "org.apache.tika.parser.ocr.TesseractOCRParser"
+ },
+ {
+ "class": "org.apache.tika.parser.pdf.PDFParser",
+ "params": {
+ "extractInlineImages": true
+ }
+ }
+ ]
+}
diff --git
a/tika-e2e-tests/tika-grpc/sample-configs/customocr/tika-config-inline.xml
b/tika-e2e-tests/tika-grpc/sample-configs/customocr/tika-config-inline.xml
new file mode 100644
index 000000000..7568863b0
--- /dev/null
+++ b/tika-e2e-tests/tika-grpc/sample-configs/customocr/tika-config-inline.xml
@@ -0,0 +1,49 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one or more
+ ~ contributor license agreements. See the NOTICE file distributed with
+ ~ this work for additional information regarding copyright ownership.
+ ~ The ASF licenses this file to You under the Apache License, Version 2.0
+ ~ (the "License"); you may not use this file except in compliance with
+ ~ the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing, software
+ ~ distributed under the License is distributed on an "AS IS" BASIS,
+ ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ ~ See the License for the specific language governing permissions and
+ ~ limitations under the License.
+ -->
+<properties>
+ <async>
+ <staleFetcherTimeoutSeconds>600</staleFetcherTimeoutSeconds>
+ <staleFetcherDelaySeconds>60</staleFetcherDelaySeconds>
+ </async>
+ <pipes>
+ <params>
+ <numClients>2</numClients>
+ <forkedJvmArgs>
+ <arg>-Xmx1g</arg>
+ <arg>-XX:ParallelGCThreads=2</arg>
+ </forkedJvmArgs>
+ <timeoutMillis>60000</timeoutMillis>
+ <maxForEmitBatchBytes>-1</maxForEmitBatchBytes> <!-- disable emit
-->
+ </params>
+ </pipes>
+ <fetchers>
+ </fetchers>
+
+ <parsers>
+ <!-- Load TesseractOCRParser (could use DefaultParser if you want others
too) -->
+ <parser class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
+
+ <!-- Extract and OCR Inline Images in PDF -->
+ <parser class="org.apache.tika.parser.pdf.PDFParser">
+ <params>
+ <param name="extractInlineImages" type="bool">true</param>
+ </params>
+ </parser>
+
+ </parsers>
+</properties>
diff --git
a/tika-e2e-tests/tika-grpc/sample-configs/customocr/tika-config-rendered.json
b/tika-e2e-tests/tika-grpc/sample-configs/customocr/tika-config-rendered.json
new file mode 100644
index 000000000..a3d854589
--- /dev/null
+++
b/tika-e2e-tests/tika-grpc/sample-configs/customocr/tika-config-rendered.json
@@ -0,0 +1,28 @@
+{
+ "async": {
+ "staleFetcherTimeoutSeconds": 600,
+ "staleFetcherDelaySeconds": 60
+ },
+ "pipes": {
+ "numClients": 2,
+ "forkedJvmArgs": [
+ "-Xmx1g",
+ "-XX:ParallelGCThreads=2"
+ ],
+ "timeoutMillis": 60000,
+ "maxForEmitBatchBytes": -1
+ },
+ "parsers": [
+ {
+ "class": "org.apache.tika.parser.ocr.TesseractOCRParser"
+ },
+ {
+ "class": "org.apache.tika.parser.pdf.PDFParser",
+ "params": {
+ "ocrStrategy": "ocr_only",
+ "ocrImageType": "rgb",
+ "ocrDPI": 100
+ }
+ }
+ ]
+}
diff --git
a/tika-e2e-tests/tika-grpc/sample-configs/customocr/tika-config-rendered.xml
b/tika-e2e-tests/tika-grpc/sample-configs/customocr/tika-config-rendered.xml
new file mode 100644
index 000000000..af308eb71
--- /dev/null
+++ b/tika-e2e-tests/tika-grpc/sample-configs/customocr/tika-config-rendered.xml
@@ -0,0 +1,55 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one or more
+ ~ contributor license agreements. See the NOTICE file distributed with
+ ~ this work for additional information regarding copyright ownership.
+ ~ The ASF licenses this file to You under the Apache License, Version 2.0
+ ~ (the "License"); you may not use this file except in compliance with
+ ~ the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing, software
+ ~ distributed under the License is distributed on an "AS IS" BASIS,
+ ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ ~ See the License for the specific language governing permissions and
+ ~ limitations under the License.
+ -->
+<properties>
+ <async>
+ <staleFetcherTimeoutSeconds>600</staleFetcherTimeoutSeconds>
+ <staleFetcherDelaySeconds>60</staleFetcherDelaySeconds>
+ </async>
+ <pipes>
+ <params>
+ <numClients>2</numClients>
+ <forkedJvmArgs>
+ <arg>-Xmx1g</arg>
+ <arg>-XX:ParallelGCThreads=2</arg>
+ </forkedJvmArgs>
+ <timeoutMillis>60000</timeoutMillis>
+ <maxForEmitBatchBytes>-1</maxForEmitBatchBytes> <!-- disable emit
-->
+ </params>
+ </pipes>
+ <fetchers>
+ </fetchers>
+ <parsers>
+ <!-- Load TesseractOCRParser (could use DefaultParser if you want
others too) -->
+ <parser class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
+
+ <!-- OCR on Rendered Pages -->
+ <parser class="org.apache.tika.parser.pdf.PDFParser">
+ <params>
+ <!-- no_ocr - extract text only
+ ocr_only - don't extract text and just attempt OCR
+ ocr_and_text - extract text and attempt OCR (from Tika
1.24)
+ auto - extract text but if < 10 characters try OCR
+ -->
+ <param name="ocrStrategy" type="string">ocr_only</param>
+ <param name="ocrImageType" type="string">rgb</param>
+ <param name="ocrDPI" type="int">100</param>
+ </params>
+ </parser>
+
+ </parsers>
+</properties>
diff --git
a/tika-e2e-tests/tika-grpc/sample-configs/grobid/org/apache/tika/parser/journal/GrobidExtractor.properties
b/tika-e2e-tests/tika-grpc/sample-configs/grobid/org/apache/tika/parser/journal/GrobidExtractor.properties
new file mode 100644
index 000000000..44689a2bb
--- /dev/null
+++
b/tika-e2e-tests/tika-grpc/sample-configs/grobid/org/apache/tika/parser/journal/GrobidExtractor.properties
@@ -0,0 +1,16 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+grobid.server.url=http://grobid:8070
\ No newline at end of file
diff --git a/tika-e2e-tests/tika-grpc/sample-configs/grobid/tika-config.json
b/tika-e2e-tests/tika-grpc/sample-configs/grobid/tika-config.json
new file mode 100644
index 000000000..740a17d2a
--- /dev/null
+++ b/tika-e2e-tests/tika-grpc/sample-configs/grobid/tika-config.json
@@ -0,0 +1,23 @@
+{
+ "async": {
+ "staleFetcherTimeoutSeconds": 600,
+ "staleFetcherDelaySeconds": 60
+ },
+ "pipes": {
+ "numClients": 2,
+ "forkedJvmArgs": [
+ "-Xmx1g",
+ "-XX:ParallelGCThreads=2"
+ ],
+ "timeoutMillis": 60000,
+ "maxForEmitBatchBytes": -1
+ },
+ "parsers": [
+ {
+ "class": "org.apache.tika.parser.journal.JournalParser",
+ "supportedMimeTypes": [
+ "application/pdf"
+ ]
+ }
+ ]
+}
diff --git a/tika-e2e-tests/tika-grpc/sample-configs/grobid/tika-config.xml
b/tika-e2e-tests/tika-grpc/sample-configs/grobid/tika-config.xml
new file mode 100644
index 000000000..1974ce476
--- /dev/null
+++ b/tika-e2e-tests/tika-grpc/sample-configs/grobid/tika-config.xml
@@ -0,0 +1,41 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one or more
+ ~ contributor license agreements. See the NOTICE file distributed with
+ ~ this work for additional information regarding copyright ownership.
+ ~ The ASF licenses this file to You under the Apache License, Version 2.0
+ ~ (the "License"); you may not use this file except in compliance with
+ ~ the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing, software
+ ~ distributed under the License is distributed on an "AS IS" BASIS,
+ ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ ~ See the License for the specific language governing permissions and
+ ~ limitations under the License.
+ -->
+<properties>
+ <async>
+ <staleFetcherTimeoutSeconds>600</staleFetcherTimeoutSeconds>
+ <staleFetcherDelaySeconds>60</staleFetcherDelaySeconds>
+ </async>
+ <pipes>
+ <params>
+ <numClients>2</numClients>
+ <forkedJvmArgs>
+ <arg>-Xmx1g</arg>
+ <arg>-XX:ParallelGCThreads=2</arg>
+ </forkedJvmArgs>
+ <timeoutMillis>60000</timeoutMillis>
+ <maxForEmitBatchBytes>-1</maxForEmitBatchBytes> <!-- disable emit
-->
+ </params>
+ </pipes>
+ <fetchers>
+ </fetchers>
+ <parsers>
+ <parser class="org.apache.tika.parser.journal.JournalParser">
+ <mime>application/pdf</mime>
+ </parser>
+ </parsers>
+</properties>
diff --git a/tika-e2e-tests/tika-grpc/sample-configs/ignite/README.md
b/tika-e2e-tests/tika-grpc/sample-configs/ignite/README.md
new file mode 100644
index 000000000..95305375d
--- /dev/null
+++ b/tika-e2e-tests/tika-grpc/sample-configs/ignite/README.md
@@ -0,0 +1,117 @@
+# Apache Ignite ConfigStore Configuration
+
+This directory contains sample configurations for running tika-grpc with
Apache Ignite distributed configuration storage.
+
+## Building the Image
+
+To build a Docker image from the TIKA-4583 branch with Ignite support:
+
+```bash
+./build-from-branch.sh -b TIKA-4583-ignite-config-store -i -t ignite-test
+```
+
+## Running Standalone
+
+Run a single instance with Ignite (useful for testing):
+
+```bash
+docker run -p 50052:50052 \
+ -v
$(pwd)/sample-configs/ignite/tika-config-ignite.json:/config/tika-config.json \
+ apache/tika-grpc:ignite-test \
+ -c /config/tika-config.json
+```
+
+## Running in Docker Compose (Clustered)
+
+Create a `docker-compose.yml`:
+
+```yaml
+version: '3.8'
+
+services:
+ tika-grpc-1:
+ image: apache/tika-grpc:ignite-test
+ ports:
+ - "50052:50052"
+ volumes:
+ -
./sample-configs/ignite/tika-config-ignite.json:/config/tika-config.json
+ command: ["-c", "/config/tika-config.json"]
+ networks:
+ - tika-cluster
+
+ tika-grpc-2:
+ image: apache/tika-grpc:ignite-test
+ ports:
+ - "50053:50052"
+ volumes:
+ -
./sample-configs/ignite/tika-config-ignite.json:/config/tika-config.json
+ command: ["-c", "/config/tika-config.json"]
+ networks:
+ - tika-cluster
+
+ tika-grpc-3:
+ image: apache/tika-grpc:ignite-test
+ ports:
+ - "50054:50052"
+ volumes:
+ -
./sample-configs/ignite/tika-config-ignite.json:/config/tika-config.json
+ command: ["-c", "/config/tika-config.json"]
+ networks:
+ - tika-cluster
+
+networks:
+ tika-cluster:
+ driver: bridge
+```
+
+Start the cluster:
+
+```bash
+docker-compose up
+```
+
+## Verifying Cluster Formation
+
+Check the logs to verify Ignite cluster formation:
+
+```bash
+docker-compose logs | grep "Topology snapshot"
+```
+
+You should see output like:
+```
+Topology snapshot [ver=3, servers=3, clients=0, ...]
+```
+
+## Testing Configuration Sharing
+
+1. Create a fetcher on one server:
+```bash
+# Add fetcher to server 1 (port 50052)
+grpcurl -d '{"fetcher_config":
"{\"id\":\"shared-fetcher\",\"name\":\"file-system\",\"params\":{\"basePath\":\"/data\"}}"}'
\
+ -plaintext localhost:50052 tika.Tika/SaveFetcher
+```
+
+2. Retrieve it from another server:
+```bash
+# Get fetcher from server 2 (port 50053)
+grpcurl -d '{"fetcher_id": "shared-fetcher"}' \
+ -plaintext localhost:50053 tika.Tika/GetFetcher
+```
+
+The fetcher should be available on all servers in the cluster!
+
+## Configuration Options
+
+Edit `tika-config-ignite.json` to customize:
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `cacheName` | Name of the Ignite cache | `tika-config-store` |
+| `cacheMode` | Cache mode (REPLICATED or PARTITIONED) | `REPLICATED` |
+| `igniteInstanceName` | Ignite instance name | `TikaIgniteCluster` |
+| `autoClose` | Auto-close Ignite on shutdown | `true` |
+
+## Kubernetes Deployment
+
+See the main [Ignite ConfigStore
README](https://github.com/apache/tika/tree/TIKA-4583-ignite-config-store/tika-pipes/tika-ignite-config-store#kubernetes-deployment)
for comprehensive Kubernetes deployment instructions.
diff --git
a/tika-e2e-tests/tika-grpc/sample-configs/ignite/tika-config-ignite.json
b/tika-e2e-tests/tika-grpc/sample-configs/ignite/tika-config-ignite.json
new file mode 100644
index 000000000..69da03028
--- /dev/null
+++ b/tika-e2e-tests/tika-grpc/sample-configs/ignite/tika-config-ignite.json
@@ -0,0 +1,24 @@
+{
+ "pipes": {
+ "configStoreType": "ignite",
+ "configStoreParams": "{\n \"cacheName\": \"tika-config-store\",\n
\"cacheMode\": \"REPLICATED\",\n \"igniteInstanceName\":
\"TikaIgniteCluster\",\n \"autoClose\": true\n }"
+ },
+ "fetchers": [
+ {
+ "id": "fs",
+ "name": "file-system",
+ "params": {
+ "basePath": "/data/input"
+ }
+ }
+ ],
+ "emitters": [
+ {
+ "id": "fs",
+ "name": "file-system",
+ "params": {
+ "basePath": "/data/output"
+ }
+ }
+ ]
+}
diff --git a/tika-e2e-tests/tika-grpc/sample-configs/ner/run_tika_server.sh
b/tika-e2e-tests/tika-grpc/sample-configs/ner/run_tika_server.sh
new file mode 100755
index 000000000..4f81d7a00
--- /dev/null
+++ b/tika-e2e-tests/tika-grpc/sample-configs/ner/run_tika_server.sh
@@ -0,0 +1,62 @@
+#!/bin/bash
+
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+#############################################################################
+# See https://cwiki.apache.org/confluence/display/TIKA/TikaAndNER for details
+# on how to configure additional NER libraries
+#############################################################################
+
+# ------------------------------------
+# Download OpenNLP Models to classpath
+# ------------------------------------
+
+OPENNLP_LOCATION="/ner/org/apache/tika-grpc/parser/ner/opennlp"
+URL="http://opennlp.sourceforge.net/models-1.5"
+
+mkdir -p $OPENNLP_LOCATION
+if [ "$(ls -A $OPENNLP_LOCATION/*.bin)" ]; then
+ echo "OpenNLP models directory has files, so skipping fetch";
+else
+ echo "No OpenNLP models found, so fetching them"
+ wget "$URL/en-ner-person.bin" -O $OPENNLP_LOCATION/ner-person.bin
+ wget "$URL/en-ner-location.bin" -O $OPENNLP_LOCATION/ner-location.bin
+ wget "$URL/en-ner-organization.bin" -O
$OPENNLP_LOCATION/ner-organization.bin;
+ wget "$URL/en-ner-date.bin" -O $OPENNLP_LOCATION/ner-date.bin
+ wget "$URL/en-ner-time.bin" -O $OPENNLP_LOCATION/ner-time.bin
+ wget "$URL/en-ner-percentage.bin" -O
$OPENNLP_LOCATION/ner-percentage.bin
+ wget "$URL/en-ner-money.bin" -O $OPENNLP_LOCATION/ner-money.bin
+fi
+
+# --------------------------------------------
+# Create RexExp Example for Email on classpath
+# --------------------------------------------
+REGEXP_LOCATION="/ner/org/apache/tika-grpc/parser/ner/regex"
+mkdir -p $REGEXP_LOCATION
+echo
"EMAIL=(?:[a-z0-9!#$%&'*+/=?^_\`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_\`{|}~-]+)*|\"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*\")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"
> $REGEXP_LOCATION/ner-regex.txt
+
+
+# -------------------
+# Now run Tika Server
+# -------------------
+
+# Can be a single implementation or comma seperated list for multiple for
"ner.impl.class" property
+RECOGNISERS=org.apache.tika.parser.ner.opennlp.OpenNLPNERecogniser,org.apache.tika.parser.ner.regex.RegexNERecogniser
+# Set classpath to the Tika Server JAR and the /ner folder so it has the
configuration and models from above
+CLASSPATH="/ner:/tika-server-standard-${TIKA_VERSION}.jar:/tika-extras/*"
+# Run the server with the custom configuration ner.impl.class property and
custom /ner/tika-config.xml
+exec java -Dner.impl.class=$RECOGNISERS -cp $CLASSPATH
org.apache.tika.pipes.grpc.TikaGrpcServer -c /ner/tika-config.xml
\ No newline at end of file
diff --git a/tika-e2e-tests/tika-grpc/sample-configs/ner/tika-config.json
b/tika-e2e-tests/tika-grpc/sample-configs/ner/tika-config.json
new file mode 100644
index 000000000..d984e2b3b
--- /dev/null
+++ b/tika-e2e-tests/tika-grpc/sample-configs/ner/tika-config.json
@@ -0,0 +1,26 @@
+{
+ "async": {
+ "staleFetcherTimeoutSeconds": 600,
+ "staleFetcherDelaySeconds": 60
+ },
+ "pipes": {
+ "numClients": 2,
+ "forkedJvmArgs": [
+ "-Xmx1g",
+ "-XX:ParallelGCThreads=2"
+ ],
+ "timeoutMillis": 60000,
+ "maxForEmitBatchBytes": -1
+ },
+ "parsers": [
+ {
+ "class": "org.apache.tika.parser.ner.NamedEntityParser",
+ "supportedMimeTypes": [
+ "application/pdf",
+ "text/plain",
+ "text/html",
+ "application/xhtml+xml"
+ ]
+ }
+ ]
+}
diff --git a/tika-e2e-tests/tika-grpc/sample-configs/ner/tika-config.xml
b/tika-e2e-tests/tika-grpc/sample-configs/ner/tika-config.xml
new file mode 100644
index 000000000..d9d6a2f9c
--- /dev/null
+++ b/tika-e2e-tests/tika-grpc/sample-configs/ner/tika-config.xml
@@ -0,0 +1,45 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one or more
+ ~ contributor license agreements. See the NOTICE file distributed with
+ ~ this work for additional information regarding copyright ownership.
+ ~ The ASF licenses this file to You under the Apache License, Version 2.0
+ ~ (the "License"); you may not use this file except in compliance with
+ ~ the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing, software
+ ~ distributed under the License is distributed on an "AS IS" BASIS,
+ ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ ~ See the License for the specific language governing permissions and
+ ~ limitations under the License.
+ -->
+<properties>
+ <async>
+ <staleFetcherTimeoutSeconds>600</staleFetcherTimeoutSeconds>
+ <staleFetcherDelaySeconds>60</staleFetcherDelaySeconds>
+ </async>
+ <pipes>
+ <params>
+ <numClients>2</numClients>
+ <forkedJvmArgs>
+ <arg>-Xmx1g</arg>
+ <arg>-XX:ParallelGCThreads=2</arg>
+ </forkedJvmArgs>
+ <timeoutMillis>60000</timeoutMillis>
+ <maxForEmitBatchBytes>-1</maxForEmitBatchBytes> <!--
disable emit -->
+ </params>
+ </pipes>
+ <fetchers>
+ </fetchers>
+ <parsers>
+ <parser class="org.apache.tika.parser.ner.NamedEntityParser">
+ <mime>application/pdf</mime>
+ <mime>text/plain</mime>
+ <mime>text/html</mime>
+ <mime>application/xhtml+xml</mime>
+ </parser>
+ </parsers>
+</properties>
+
diff --git a/tika-e2e-tests/tika-grpc/sample-configs/test-simple.json
b/tika-e2e-tests/tika-grpc/sample-configs/test-simple.json
new file mode 100644
index 000000000..000bb0181
--- /dev/null
+++ b/tika-e2e-tests/tika-grpc/sample-configs/test-simple.json
@@ -0,0 +1,20 @@
+{
+ "fetchers": [
+ {
+ "fs": {
+ "defaultFetcher": {
+ "basePath": "/data/input"
+ }
+ }
+ }
+ ],
+ "emitters": [
+ {
+ "fs": {
+ "defaultEmitter": {
+ "basePath": "/data/output"
+ }
+ }
+ }
+ ]
+}
diff --git
a/tika-e2e-tests/tika-grpc/sample-configs/vision/inception-rest-caption.json
b/tika-e2e-tests/tika-grpc/sample-configs/vision/inception-rest-caption.json
new file mode 100644
index 000000000..be34d0560
--- /dev/null
+++ b/tika-e2e-tests/tika-grpc/sample-configs/vision/inception-rest-caption.json
@@ -0,0 +1,18 @@
+{
+ "parsers": [
+ {
+ "class": "org.apache.tika.parser.recognition.ObjectRecognitionParser",
+ "supportedMimeTypes": [
+ "image/jpeg",
+ "image/png",
+ "image/gif"
+ ],
+ "params": {
+ "apiBaseUri": "http://inception-caption:8764/inception/v3",
+ "captions": 5,
+ "maxCaptionLength": 15,
+ "class": "org.apache.tika.parser.captioning.tf.TensorflowRESTCaptioner"
+ }
+ }
+ ]
+}
diff --git
a/tika-e2e-tests/tika-grpc/sample-configs/vision/inception-rest-caption.xml
b/tika-e2e-tests/tika-grpc/sample-configs/vision/inception-rest-caption.xml
new file mode 100644
index 000000000..c70c207b2
--- /dev/null
+++ b/tika-e2e-tests/tika-grpc/sample-configs/vision/inception-rest-caption.xml
@@ -0,0 +1,32 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one or more
+ ~ contributor license agreements. See the NOTICE file distributed with
+ ~ this work for additional information regarding copyright ownership.
+ ~ The ASF licenses this file to You under the Apache License, Version 2.0
+ ~ (the "License"); you may not use this file except in compliance with
+ ~ the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing, software
+ ~ distributed under the License is distributed on an "AS IS" BASIS,
+ ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ ~ See the License for the specific language governing permissions and
+ ~ limitations under the License.
+ -->
+<properties>
+ <parsers>
+ <parser
class="org.apache.tika.parser.recognition.ObjectRecognitionParser">
+ <mime>image/jpeg</mime>
+ <mime>image/png</mime>
+ <mime>image/gif</mime>
+ <params>
+ <param name="apiBaseUri"
type="uri">http://inception-caption:8764/inception/v3</param>
+ <param name="captions" type="int">5</param>
+ <param name="maxCaptionLength" type="int">15</param>
+ <param name="class"
type="string">org.apache.tika.parser.captioning.tf.TensorflowRESTCaptioner</param>
+ </params>
+ </parser>
+ </parsers>
+</properties>
\ No newline at end of file
diff --git
a/tika-e2e-tests/tika-grpc/sample-configs/vision/inception-rest-video.json
b/tika-e2e-tests/tika-grpc/sample-configs/vision/inception-rest-video.json
new file mode 100644
index 000000000..73f5de655
--- /dev/null
+++ b/tika-e2e-tests/tika-grpc/sample-configs/vision/inception-rest-video.json
@@ -0,0 +1,18 @@
+{
+ "parsers": [
+ {
+ "class": "org.apache.tika.parser.recognition.ObjectRecognitionParser",
+ "supportedMimeTypes": [
+ "video/mp4",
+ "video/quicktime"
+ ],
+ "params": {
+ "apiBaseUri": "http://inception-video:8764/inception/v4",
+ "topN": 4,
+ "minConfidence": 0.015,
+ "mode": "fixed",
+ "class":
"org.apache.tika.parser.recognition.tf.TensorflowRESTVideoRecogniser"
+ }
+ }
+ ]
+}
diff --git
a/tika-e2e-tests/tika-grpc/sample-configs/vision/inception-rest-video.xml
b/tika-e2e-tests/tika-grpc/sample-configs/vision/inception-rest-video.xml
new file mode 100644
index 000000000..f6a4e6a93
--- /dev/null
+++ b/tika-e2e-tests/tika-grpc/sample-configs/vision/inception-rest-video.xml
@@ -0,0 +1,32 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one or more
+ ~ contributor license agreements. See the NOTICE file distributed with
+ ~ this work for additional information regarding copyright ownership.
+ ~ The ASF licenses this file to You under the Apache License, Version 2.0
+ ~ (the "License"); you may not use this file except in compliance with
+ ~ the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing, software
+ ~ distributed under the License is distributed on an "AS IS" BASIS,
+ ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ ~ See the License for the specific language governing permissions and
+ ~ limitations under the License.
+ -->
+<properties>
+ <parsers>
+ <parser
class="org.apache.tika.parser.recognition.ObjectRecognitionParser">
+ <mime>video/mp4</mime>
+ <mime>video/quicktime</mime>
+ <params>
+ <param name="apiBaseUri"
type="uri">http://inception-video:8764/inception/v4</param>
+ <param name="topN" type="int">4</param>
+ <param name="minConfidence" type="double">0.015</param>
+ <param name="mode" type="string">fixed</param>
+ <param name="class"
type="string">org.apache.tika.parser.recognition.tf.TensorflowRESTVideoRecogniser</param>
+ </params>
+ </parser>
+ </parsers>
+</properties>
\ No newline at end of file
diff --git a/tika-e2e-tests/tika-grpc/sample-configs/vision/inception-rest.json
b/tika-e2e-tests/tika-grpc/sample-configs/vision/inception-rest.json
new file mode 100644
index 000000000..eb8ba044c
--- /dev/null
+++ b/tika-e2e-tests/tika-grpc/sample-configs/vision/inception-rest.json
@@ -0,0 +1,18 @@
+{
+ "parsers": [
+ {
+ "class": "org.apache.tika.parser.recognition.ObjectRecognitionParser",
+ "supportedMimeTypes": [
+ "image/jpeg",
+ "image/png",
+ "image/gif"
+ ],
+ "params": {
+ "apiBaseUri": "http://inception-rest:8764/inception/v4",
+ "topN": 2,
+ "minConfidence": 0.015,
+ "class":
"org.apache.tika.parser.recognition.tf.TensorflowRESTRecogniser"
+ }
+ }
+ ]
+}
diff --git a/tika-e2e-tests/tika-grpc/sample-configs/vision/inception-rest.xml
b/tika-e2e-tests/tika-grpc/sample-configs/vision/inception-rest.xml
new file mode 100644
index 000000000..caa646859
--- /dev/null
+++ b/tika-e2e-tests/tika-grpc/sample-configs/vision/inception-rest.xml
@@ -0,0 +1,32 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one or more
+ ~ contributor license agreements. See the NOTICE file distributed with
+ ~ this work for additional information regarding copyright ownership.
+ ~ The ASF licenses this file to You under the Apache License, Version 2.0
+ ~ (the "License"); you may not use this file except in compliance with
+ ~ the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing, software
+ ~ distributed under the License is distributed on an "AS IS" BASIS,
+ ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ ~ See the License for the specific language governing permissions and
+ ~ limitations under the License.
+ -->
+<properties>
+ <parsers>
+ <parser
class="org.apache.tika.parser.recognition.ObjectRecognitionParser">
+ <mime>image/jpeg</mime>
+ <mime>image/png</mime>
+ <mime>image/gif</mime>
+ <params>
+ <param name="apiBaseUri"
type="uri">http://inception-rest:8764/inception/v4</param>
+ <param name="topN" type="int">2</param>
+ <param name="minConfidence" type="double">0.015</param>
+ <param name="class"
type="string">org.apache.tika.parser.recognition.tf.TensorflowRESTRecogniser</param>
+ </params>
+ </parser>
+ </parsers>
+</properties>
diff --git
a/tika-e2e-tests/tika-grpc/src/test/java/org/apache/tika/pipes/ExternalTestBase.java
b/tika-e2e-tests/tika-grpc/src/test/java/org/apache/tika/pipes/ExternalTestBase.java
new file mode 100644
index 000000000..511d671c6
--- /dev/null
+++
b/tika-e2e-tests/tika-grpc/src/test/java/org/apache/tika/pipes/ExternalTestBase.java
@@ -0,0 +1,183 @@
+package org.apache.tika.pipes;
+
+import com.fasterxml.jackson.databind.ObjectMapper;
+import io.grpc.ManagedChannel;
+import io.grpc.ManagedChannelBuilder;
+import lombok.extern.slf4j.Slf4j;
+import org.apache.tika.FetchAndParseReply;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Tag;
+import org.junit.jupiter.api.TestInstance;
+import org.testcontainers.containers.DockerComposeContainer;
+import org.testcontainers.containers.output.Slf4jLogConsumer;
+import org.testcontainers.containers.wait.strategy.Wait;
+import org.testcontainers.junit.jupiter.Testcontainers;
+
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+import java.net.URL;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.nio.file.StandardCopyOption;
+import java.time.Duration;
+import java.time.temporal.ChronoUnit;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.concurrent.Executors;
+import java.util.concurrent.TimeUnit;
+import java.util.regex.Pattern;
+import java.util.stream.Stream;
+import java.util.zip.ZipEntry;
+import java.util.zip.ZipInputStream;
+
+/**
+ * Base class for Tika gRPC end-to-end tests.
+ * Uses Docker Compose to start tika-grpc server and runs tests against it.
+ */
+@TestInstance(TestInstance.Lifecycle.PER_CLASS)
+@Testcontainers
+@Slf4j
+@Tag("E2ETest")
+public abstract class ExternalTestBase {
+ public static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
+ public static final int MAX_STARTUP_TIMEOUT = 120;
+ public static final String GOV_DOCS_FOLDER = "/tika/govdocs1";
+ public static final File TEST_FOLDER = new File("target", "govdocs1");
+ public static final int GOV_DOCS_FROM_IDX =
Integer.parseInt(System.getProperty("govdocs1.fromIndex", "1"));
+ public static final int GOV_DOCS_TO_IDX =
Integer.parseInt(System.getProperty("govdocs1.toIndex", "1"));
+ public static final String DIGITAL_CORPORA_ZIP_FILES_URL =
"https://corp.digitalcorpora.org/corpora/files/govdocs1/zipfiles";
+
+ public static DockerComposeContainer<?> composeContainer;
+
+ @BeforeAll
+ static void setup() throws Exception {
+ loadGovdocs1();
+
+ composeContainer = new DockerComposeContainer<>(
+ new File("src/test/resources/docker-compose.yml"))
+ .withEnv("HOST_GOVDOCS1_DIR", TEST_FOLDER.getAbsolutePath())
+ .withStartupTimeout(Duration.of(MAX_STARTUP_TIMEOUT,
ChronoUnit.SECONDS))
+ .withExposedService("tika-grpc", 50052,
+ Wait.forLogMessage(".*Server started.*\\n", 1))
+ .withLogConsumer("tika-grpc", new Slf4jLogConsumer(log));
+
+ composeContainer.start();
+
+ log.info("Docker Compose containers started successfully");
+ }
+
+ private static void loadGovdocs1() throws IOException,
InterruptedException {
+ int retries = 3;
+ int attempt = 0;
+ while (true) {
+ try {
+ downloadAndUnzipGovdocs1(GOV_DOCS_FROM_IDX, GOV_DOCS_TO_IDX);
+ break;
+ } catch (IOException e) {
+ attempt++;
+ if (attempt >= retries) {
+ throw e;
+ }
+ log.warn("Download attempt {} failed, retrying in 10
seconds...", attempt, e);
+ TimeUnit.SECONDS.sleep(10);
+ }
+ }
+ }
+
+ @AfterAll
+ void close() {
+ if (composeContainer != null) {
+ composeContainer.close();
+ }
+ }
+
+ public static void downloadAndUnzipGovdocs1(int fromIndex, int toIndex)
throws IOException {
+ Path targetDir = TEST_FOLDER.toPath();
+ Files.createDirectories(targetDir);
+
+ for (int i = fromIndex; i <= toIndex; i++) {
+ String zipName = String.format("%03d.zip", i);
+ String url = DIGITAL_CORPORA_ZIP_FILES_URL + "/" + zipName;
+ Path zipPath = targetDir.resolve(zipName);
+
+ if (Files.exists(zipPath)) {
+ log.info("{} already exists, skipping download", zipName);
+ continue;
+ }
+
+ log.info("Downloading {} from {}...", zipName, url);
+ try (InputStream in = new URL(url).openStream()) {
+ Files.copy(in, zipPath, StandardCopyOption.REPLACE_EXISTING);
+ }
+
+ log.info("Unzipping {}...", zipName);
+ try (ZipInputStream zis = new ZipInputStream(new
FileInputStream(zipPath.toFile()))) {
+ ZipEntry entry;
+ while ((entry = zis.getNextEntry()) != null) {
+ Path outPath = targetDir.resolve(entry.getName());
+ if (entry.isDirectory()) {
+ Files.createDirectories(outPath);
+ } else {
+ Files.createDirectories(outPath.getParent());
+ try (OutputStream out =
Files.newOutputStream(outPath)) {
+ zis.transferTo(out);
+ }
+ }
+ zis.closeEntry();
+ }
+ }
+ }
+
+ log.info("Finished downloading and extracting govdocs1 files");
+ }
+
+ public static void assertAllFilesFetched(Path baseDir,
List<FetchAndParseReply> successes,
+ List<FetchAndParseReply> errors) {
+ Set<String> allFetchKeys = new HashSet<>();
+ for (FetchAndParseReply reply : successes) {
+ allFetchKeys.add(reply.getFetchKey());
+ }
+ for (FetchAndParseReply reply : errors) {
+ allFetchKeys.add(reply.getFetchKey());
+ }
+
+ Set<String> keysFromGovdocs1 = new HashSet<>();
+ try (Stream<Path> paths = Files.walk(baseDir)) {
+ paths.filter(Files::isRegularFile)
+ .forEach(file -> {
+ String relPath = baseDir.relativize(file).toString();
+ if
(Pattern.compile("\\d{3}\\.zip").matcher(relPath).find()) {
+ return;
+ }
+ keysFromGovdocs1.add(relPath);
+ });
+ } catch (IOException e) {
+ throw new RuntimeException(e);
+ }
+
+ Assertions.assertNotEquals(0, successes.size(), "Should have some
successful fetches");
+ // Note: errors.size() can be 0 if all files parse successfully
+ log.info("Processed {} files: {} successes, {} errors",
allFetchKeys.size(), successes.size(), errors.size());
+ Assertions.assertEquals(keysFromGovdocs1, allFetchKeys, () -> {
+ Set<String> missing = new HashSet<>(keysFromGovdocs1);
+ missing.removeAll(allFetchKeys);
+ return "Missing fetch keys: " + missing;
+ });
+ }
+
+ public static ManagedChannel getManagedChannel() {
+ return ManagedChannelBuilder
+ .forAddress(composeContainer.getServiceHost("tika-grpc",
50052),
+ composeContainer.getServicePort("tika-grpc", 50052))
+ .usePlaintext()
+ .executor(Executors.newCachedThreadPool())
+ .maxInboundMessageSize(160 * 1024 * 1024)
+ .build();
+ }
+}
diff --git
a/tika-e2e-tests/tika-grpc/src/test/java/org/apache/tika/pipes/filesystem/FileSystemFetcherTest.java
b/tika-e2e-tests/tika-grpc/src/test/java/org/apache/tika/pipes/filesystem/FileSystemFetcherTest.java
new file mode 100644
index 000000000..d5e6c15ec
--- /dev/null
+++
b/tika-e2e-tests/tika-grpc/src/test/java/org/apache/tika/pipes/filesystem/FileSystemFetcherTest.java
@@ -0,0 +1,141 @@
+package org.apache.tika.pipes.filesystem;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import io.grpc.ManagedChannel;
+import io.grpc.stub.StreamObserver;
+import lombok.extern.slf4j.Slf4j;
+import org.apache.tika.FetchAndParseReply;
+import org.apache.tika.FetchAndParseRequest;
+import org.apache.tika.GetFetcherReply;
+import org.apache.tika.GetFetcherRequest;
+import org.apache.tika.SaveFetcherReply;
+import org.apache.tika.SaveFetcherRequest;
+import org.apache.tika.TikaGrpc;
+import org.apache.tika.pipes.ExternalTestBase;
+import org.apache.tika.pipes.fetcher.fs.FileSystemFetcherConfig;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import java.util.UUID;
+import java.util.concurrent.CountDownLatch;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Stream;
+
+@Slf4j
+class FileSystemFetcherTest extends ExternalTestBase {
+
+ @Test
+ void testFileSystemFetcher() throws Exception {
+ String fetcherId = "defaultFetcher";
+ ManagedChannel channel = getManagedChannel();
+ TikaGrpc.TikaBlockingStub blockingStub =
TikaGrpc.newBlockingStub(channel);
+ TikaGrpc.TikaStub tikaStub = TikaGrpc.newStub(channel);
+
+ // Create and save the fetcher dynamically
+ FileSystemFetcherConfig config = new FileSystemFetcherConfig();
+ config.setBasePath("/tika/govdocs1");
+
+ String configJson = OBJECT_MAPPER.writeValueAsString(config);
+ log.info("Creating fetcher with config: {}", configJson);
+
+ SaveFetcherReply saveReply =
blockingStub.saveFetcher(SaveFetcherRequest
+ .newBuilder()
+ .setFetcherId(fetcherId)
+
.setFetcherClass("org.apache.tika.pipes.fetcher.fs.FileSystemFetcher")
+ .setFetcherConfigJson(configJson)
+ .build());
+
+ log.info("Fetcher created: {}", saveReply.getFetcherId());
+
+ List<FetchAndParseReply> successes = Collections.synchronizedList(new
ArrayList<>());
+ List<FetchAndParseReply> errors = Collections.synchronizedList(new
ArrayList<>());
+
+ CountDownLatch countDownLatch = new CountDownLatch(1);
+ StreamObserver<FetchAndParseRequest>
+ requestStreamObserver =
tikaStub.fetchAndParseBiDirectionalStreaming(new StreamObserver<>() {
+ @Override
+ public void onNext(FetchAndParseReply fetchAndParseReply) {
+ log.debug("Reply from fetch-and-parse - key={}, status={}",
+ fetchAndParseReply.getFetchKey(),
fetchAndParseReply.getStatus());
+ if
("FETCH_AND_PARSE_EXCEPTION".equals(fetchAndParseReply.getStatus())) {
+ errors.add(fetchAndParseReply);
+ } else {
+ successes.add(fetchAndParseReply);
+ }
+ }
+
+ @Override
+ public void onError(Throwable throwable) {
+ log.error("Received an error", throwable);
+ Assertions.fail(throwable);
+ countDownLatch.countDown();
+ }
+
+ @Override
+ public void onCompleted() {
+ log.info("Finished streaming fetch and parse replies");
+ countDownLatch.countDown();
+ }
+ });
+
+ // Submit all files for parsing
+ int maxDocs = Integer.parseInt(System.getProperty("corpa.numdocs",
"-1"));
+ log.info("Document limit: {}", maxDocs == -1 ? "unlimited" : maxDocs);
+
+ try (Stream<Path> paths = Files.walk(TEST_FOLDER.toPath())) {
+ Stream<Path> fileStream = paths.filter(Files::isRegularFile);
+
+ // Limit number of documents if specified
+ if (maxDocs > 0) {
+ fileStream = fileStream.limit(maxDocs);
+ }
+
+ fileStream.forEach(file -> {
+ try {
+ String relPath =
TEST_FOLDER.toPath().relativize(file).toString();
+ requestStreamObserver.onNext(FetchAndParseRequest
+ .newBuilder()
+ .setFetcherId(fetcherId)
+ .setFetchKey(relPath)
+ .build());
+ } catch (Exception e) {
+ throw new RuntimeException(e);
+ }
+ });
+ }
+ log.info("Done submitting files to fetcher {}", fetcherId);
+
+ requestStreamObserver.onCompleted();
+
+ // Wait for all parsing to complete
+ try {
+ if (!countDownLatch.await(3, TimeUnit.MINUTES)) {
+ log.error("Timed out waiting for parse to complete");
+ Assertions.fail("Timed out waiting for parsing to complete");
+ }
+ } catch (InterruptedException e) {
+ Thread.currentThread().interrupt();
+ Assertions.fail("Interrupted while waiting for parsing to
complete");
+ }
+
+ // Verify all files were processed (unless we limited the number)
+ if (maxDocs == -1) {
+ assertAllFilesFetched(TEST_FOLDER.toPath(), successes, errors);
+ } else {
+ int totalProcessed = successes.size() + errors.size();
+ log.info("Processed {} documents (limit was {})", totalProcessed,
maxDocs);
+ Assertions.assertTrue(totalProcessed <= maxDocs,
+ "Should not process more than " + maxDocs + " documents");
+ Assertions.assertTrue(totalProcessed > 0,
+ "Should have processed at least one document");
+ }
+
+ log.info("Test completed successfully - {} successes, {} errors",
+ successes.size(), errors.size());
+ }
+}
diff --git
a/tika-e2e-tests/tika-grpc/src/test/java/org/apache/tika/pipes/ignite/IgniteConfigStoreTest.java
b/tika-e2e-tests/tika-grpc/src/test/java/org/apache/tika/pipes/ignite/IgniteConfigStoreTest.java
new file mode 100644
index 000000000..f3b9293cb
--- /dev/null
+++
b/tika-e2e-tests/tika-grpc/src/test/java/org/apache/tika/pipes/ignite/IgniteConfigStoreTest.java
@@ -0,0 +1,288 @@
+package org.apache.tika.pipes.ignite;
+
+import com.fasterxml.jackson.databind.ObjectMapper;
+import io.grpc.ManagedChannel;
+import io.grpc.ManagedChannelBuilder;
+import io.grpc.stub.StreamObserver;
+import lombok.extern.slf4j.Slf4j;
+import org.apache.tika.FetchAndParseReply;
+import org.apache.tika.FetchAndParseRequest;
+import org.apache.tika.SaveFetcherReply;
+import org.apache.tika.SaveFetcherRequest;
+import org.apache.tika.TikaGrpc;
+import org.apache.tika.pipes.fetcher.fs.FileSystemFetcherConfig;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Tag;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.TestInstance;
+import org.testcontainers.containers.DockerComposeContainer;
+import org.testcontainers.containers.output.Slf4jLogConsumer;
+import org.testcontainers.containers.wait.strategy.Wait;
+import org.testcontainers.junit.jupiter.Testcontainers;
+
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+import java.net.URL;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.nio.file.StandardCopyOption;
+import java.time.Duration;
+import java.time.temporal.ChronoUnit;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import java.util.concurrent.CountDownLatch;
+import java.util.concurrent.Executors;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Stream;
+import java.util.zip.ZipEntry;
+import java.util.zip.ZipInputStream;
+
+/**
+ * End-to-end test for Ignite ConfigStore.
+ * Tests that fetchers saved via gRPC are persisted in Ignite
+ * and available in the forked PipesServer process.
+ */
+@TestInstance(TestInstance.Lifecycle.PER_CLASS)
+@Testcontainers
+@Slf4j
+@Tag("E2ETest")
+class IgniteConfigStoreTest {
+
+ private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
+ private static final int MAX_STARTUP_TIMEOUT = 120;
+ private static final File TEST_FOLDER = new File("target", "govdocs1");
+ private static final int GOV_DOCS_FROM_IDX =
Integer.parseInt(System.getProperty("govdocs1.fromIndex", "1"));
+ private static final int GOV_DOCS_TO_IDX =
Integer.parseInt(System.getProperty("govdocs1.toIndex", "1"));
+ private static final String DIGITAL_CORPORA_ZIP_FILES_URL =
"https://corp.digitalcorpora.org/corpora/files/govdocs1/zipfiles";
+
+ private static DockerComposeContainer<?> igniteComposeContainer;
+
+ @BeforeAll
+ static void setupIgnite() throws Exception {
+ // Load govdocs1 if not already loaded
+ if (!TEST_FOLDER.exists() || TEST_FOLDER.listFiles().length == 0) {
+ downloadAndUnzipGovdocs1(GOV_DOCS_FROM_IDX, GOV_DOCS_TO_IDX);
+ }
+
+ igniteComposeContainer = new DockerComposeContainer<>(
+ new File("src/test/resources/docker-compose-ignite.yml"))
+ .withEnv("HOST_GOVDOCS1_DIR", TEST_FOLDER.getAbsolutePath())
+ .withStartupTimeout(Duration.of(MAX_STARTUP_TIMEOUT,
ChronoUnit.SECONDS))
+ .withExposedService("tika-grpc", 50052,
+ Wait.forLogMessage(".*Server started.*\\n", 1))
+ .withLogConsumer("tika-grpc", new Slf4jLogConsumer(log));
+
+ igniteComposeContainer.start();
+
+ log.info("Ignite Docker Compose containers started successfully");
+ }
+
+ @AfterAll
+ static void teardownIgnite() {
+ if (igniteComposeContainer != null) {
+ igniteComposeContainer.close();
+ }
+ }
+
+ @Test
+ void testIgniteConfigStore() throws Exception {
+ String fetcherId = "dynamicIgniteFetcher";
+ ManagedChannel channel = getManagedChannelForIgnite();
+ TikaGrpc.TikaBlockingStub blockingStub =
TikaGrpc.newBlockingStub(channel);
+ TikaGrpc.TikaStub tikaStub = TikaGrpc.newStub(channel);
+
+ // Create and save the fetcher dynamically
+ FileSystemFetcherConfig config = new FileSystemFetcherConfig();
+ config.setBasePath("/tika/govdocs1");
+
+ String configJson = OBJECT_MAPPER.writeValueAsString(config);
+ log.info("Creating fetcher with Ignite ConfigStore: {}", configJson);
+
+ SaveFetcherReply saveReply =
blockingStub.saveFetcher(SaveFetcherRequest
+ .newBuilder()
+ .setFetcherId(fetcherId)
+
.setFetcherClass("org.apache.tika.pipes.fetcher.fs.FileSystemFetcher")
+ .setFetcherConfigJson(configJson)
+ .build());
+
+ log.info("Fetcher saved to Ignite: {}", saveReply.getFetcherId());
+
+ List<FetchAndParseReply> successes = Collections.synchronizedList(new
ArrayList<>());
+ List<FetchAndParseReply> errors = Collections.synchronizedList(new
ArrayList<>());
+
+ CountDownLatch countDownLatch = new CountDownLatch(1);
+ StreamObserver<FetchAndParseRequest>
+ requestStreamObserver =
tikaStub.fetchAndParseBiDirectionalStreaming(new StreamObserver<>() {
+ @Override
+ public void onNext(FetchAndParseReply fetchAndParseReply) {
+ log.debug("Reply from fetch-and-parse - key={}, status={}",
+ fetchAndParseReply.getFetchKey(),
fetchAndParseReply.getStatus());
+ if
("FETCH_AND_PARSE_EXCEPTION".equals(fetchAndParseReply.getStatus())) {
+ errors.add(fetchAndParseReply);
+ } else {
+ successes.add(fetchAndParseReply);
+ }
+ }
+
+ @Override
+ public void onError(Throwable throwable) {
+ log.error("Received an error", throwable);
+ Assertions.fail(throwable);
+ countDownLatch.countDown();
+ }
+
+ @Override
+ public void onCompleted() {
+ log.info("Finished streaming fetch and parse replies");
+ countDownLatch.countDown();
+ }
+ });
+
+ // Submit files for parsing - limit to configured number
+ int maxDocs = Integer.parseInt(System.getProperty("corpa.numdocs",
"-1"));
+ log.info("Document limit: {}", maxDocs == -1 ? "unlimited" : maxDocs);
+
+ try (Stream<Path> paths = Files.walk(TEST_FOLDER.toPath())) {
+ Stream<Path> fileStream = paths.filter(Files::isRegularFile);
+
+ if (maxDocs > 0) {
+ fileStream = fileStream.limit(maxDocs);
+ }
+
+ fileStream.forEach(file -> {
+ try {
+ String relPath =
TEST_FOLDER.toPath().relativize(file).toString();
+ requestStreamObserver.onNext(FetchAndParseRequest
+ .newBuilder()
+ .setFetcherId(fetcherId)
+ .setFetchKey(relPath)
+ .build());
+ } catch (Exception e) {
+ throw new RuntimeException(e);
+ }
+ });
+ }
+ log.info("Done submitting files to Ignite-backed fetcher {}",
fetcherId);
+
+ requestStreamObserver.onCompleted();
+
+ // Wait for all parsing to complete
+ try {
+ if (!countDownLatch.await(3, TimeUnit.MINUTES)) {
+ log.error("Timed out waiting for parse to complete");
+ Assertions.fail("Timed out waiting for parsing to complete");
+ }
+ } catch (InterruptedException e) {
+ Thread.currentThread().interrupt();
+ Assertions.fail("Interrupted while waiting for parsing to
complete");
+ }
+
+ // Verify documents were processed
+ if (maxDocs == -1) {
+ assertAllFilesFetched(TEST_FOLDER.toPath(), successes, errors);
+ } else {
+ int totalProcessed = successes.size() + errors.size();
+ log.info("Processed {} documents with Ignite ConfigStore (limit
was {})",
+ totalProcessed, maxDocs);
+ Assertions.assertTrue(totalProcessed <= maxDocs,
+ "Should not process more than " + maxDocs + " documents");
+ Assertions.assertTrue(totalProcessed > 0,
+ "Should have processed at least one document");
+ }
+
+ log.info("Ignite ConfigStore test completed successfully - {}
successes, {} errors",
+ successes.size(), errors.size());
+ }
+
+ // Helper method for downloading test data
+ private static void downloadAndUnzipGovdocs1(int fromIndex, int toIndex)
throws IOException {
+ Path targetDir = TEST_FOLDER.toPath();
+ Files.createDirectories(targetDir);
+
+ for (int i = fromIndex; i <= toIndex; i++) {
+ String zipName = String.format("%03d.zip", i);
+ String url = DIGITAL_CORPORA_ZIP_FILES_URL + "/" + zipName;
+ Path zipPath = targetDir.resolve(zipName);
+
+ if (Files.exists(zipPath)) {
+ log.info("{} already exists, skipping download", zipName);
+ continue;
+ }
+
+ log.info("Downloading {} from {}...", zipName, url);
+ try (InputStream in = new URL(url).openStream()) {
+ Files.copy(in, zipPath, StandardCopyOption.REPLACE_EXISTING);
+ }
+
+ log.info("Unzipping {}...", zipName);
+ try (ZipInputStream zis = new ZipInputStream(new
FileInputStream(zipPath.toFile()))) {
+ ZipEntry entry;
+ while ((entry = zis.getNextEntry()) != null) {
+ Path outPath = targetDir.resolve(entry.getName());
+ if (entry.isDirectory()) {
+ Files.createDirectories(outPath);
+ } else {
+ Files.createDirectories(outPath.getParent());
+ try (OutputStream out =
Files.newOutputStream(outPath)) {
+ zis.transferTo(out);
+ }
+ }
+ zis.closeEntry();
+ }
+ }
+ }
+
+ log.info("Finished downloading and extracting govdocs1 files");
+ }
+
+ // Helper method to validate all files were fetched
+ private static void assertAllFilesFetched(Path baseDir,
List<FetchAndParseReply> successes,
+ List<FetchAndParseReply> errors) {
+ java.util.Set<String> allFetchKeys = new java.util.HashSet<>();
+ for (FetchAndParseReply reply : successes) {
+ allFetchKeys.add(reply.getFetchKey());
+ }
+ for (FetchAndParseReply reply : errors) {
+ allFetchKeys.add(reply.getFetchKey());
+ }
+
+ java.util.Set<String> keysFromGovdocs1 = new java.util.HashSet<>();
+ try (Stream<Path> paths = Files.walk(baseDir)) {
+ paths.filter(Files::isRegularFile)
+ .forEach(file -> {
+ String relPath = baseDir.relativize(file).toString();
+ if
(java.util.regex.Pattern.compile("\\d{3}\\.zip").matcher(relPath).find()) {
+ return;
+ }
+ keysFromGovdocs1.add(relPath);
+ });
+ } catch (IOException e) {
+ throw new RuntimeException(e);
+ }
+
+ Assertions.assertNotEquals(0, successes.size(), "Should have some
successful fetches");
+ log.info("Processed {} files: {} successes, {} errors",
allFetchKeys.size(), successes.size(), errors.size());
+ Assertions.assertEquals(keysFromGovdocs1, allFetchKeys, () -> {
+ java.util.Set<String> missing = new
java.util.HashSet<>(keysFromGovdocs1);
+ missing.removeAll(allFetchKeys);
+ return "Missing fetch keys: " + missing;
+ });
+ }
+
+ // Helper method to create gRPC channel
+ private static ManagedChannel getManagedChannelForIgnite() {
+ return ManagedChannelBuilder
+ .forAddress(igniteComposeContainer.getServiceHost("tika-grpc",
50052),
+ igniteComposeContainer.getServicePort("tika-grpc",
50052))
+ .usePlaintext()
+ .executor(Executors.newCachedThreadPool())
+ .maxInboundMessageSize(160 * 1024 * 1024)
+ .build();
+ }
+}
diff --git
a/tika-e2e-tests/tika-grpc/src/test/java/org/apache/tika/pipes/ignite/README.md
b/tika-e2e-tests/tika-grpc/src/test/java/org/apache/tika/pipes/ignite/README.md
new file mode 100644
index 000000000..813650b83
--- /dev/null
+++
b/tika-e2e-tests/tika-grpc/src/test/java/org/apache/tika/pipes/ignite/README.md
@@ -0,0 +1,172 @@
+# Ignite ConfigStore E2E Test
+
+## Overview
+
+This test verifies that the **embedded Ignite ConfigStore** works correctly
for sharing fetcher configurations between the gRPC server and forked
PipesServer processes.
+
+## Architecture
+
+The Ignite server runs **embedded within the tika-grpc process** - no separate
Ignite deployment needed!
+
+```
+┌─────────────────────────────────┐
+│ tika-grpc Process │
+│ ┌──────────────────────────┐ │
+│ │ IgniteStoreServer │ │ ← Embedded server (daemon thread)
+│ │ (server mode) │ │
+│ └────────▲─────────────────┘ │
+│ │ │
+│ ┌────────┴─────────────────┐ │
+│ │ TikaGrpcServer │ │ ← Connects as client
+│ │ IgniteConfigStore │ │
+│ └──────────────────────────┘ │
+└─────────────────────────────────┘
+ ▲
+ │ (client connection)
+ │
+ ┌────────┴─────────────────┐
+ │ PipesServer (forked) │ ← Connects as client
+ │ IgniteConfigStore │
+ └──────────────────────────┘
+```
+
+## Test Scenario
+
+1. Start tika-grpc (automatically starts embedded Ignite server)
+2. Dynamically create a fetcher via gRPC `saveFetcher`
+3. Fetcher is stored in Ignite cache
+4. Process documents using forked PipesServer
+5. PipesServer connects to Ignite as client and retrieves fetcher
+6. Verify documents are processed successfully
+
+## Prerequisites
+
+- Docker and Docker Compose
+- Maven 3.6+
+- Java 17+
+- Apache Tika Docker image with Ignite support: `apache/tika-grpc:local`
+
+## Building Tika with Embedded Ignite Support
+
+Build from the `file-based-config-store` branch:
+
+```bash
+cd /path/to/tika
+git checkout file-based-config-store
+mvn clean install -DskipTests
+
+# Build Docker image
+cd /path/to/tika-grpc-docker
+./build-from-branch.sh -l /path/to/tika -t local
+```
+
+## Running the Test
+
+### Run just the Ignite test with limited documents:
+
+```bash
+mvn test -Dtest=IgniteConfigStoreTest -Dcorpa.numdocs=5
+```
+
+### Run with all documents:
+
+```bash
+mvn test -Dtest=IgniteConfigStoreTest
+```
+
+### Run all e2e tests (file + ignite):
+
+```bash
+mvn test -Dcorpa.numdocs=10
+```
+
+## Configuration
+
+The test uses `src/test/resources/tika-config-ignite.json`:
+
+```json
+{
+ "pipes": {
+ "configStoreType": "ignite",
+ "configStoreParams": "{
+ \"cacheName\": \"tika-e2e-test\",
+ \"cacheMode\": \"REPLICATED\",
+ \"igniteInstanceName\": \"TikaE2ETest\"
+ }"
+ }
+}
+```
+
+**What happens on startup:**
+1. TikaGrpcServer reads config
+2. Sees `configStoreType: "ignite"`
+3. Automatically starts `IgniteStoreServer` in background daemon thread
+4. Creates IgniteConfigStore as client (connects to embedded server)
+5. Ready to accept gRPC requests!
+
+## Expected Behavior
+
+✅ **Success:** Embedded Ignite server starts automatically
+✅ **Success:** Fetcher created via gRPC is stored in Ignite
+✅ **Success:** Forked PipesServer connects as client and retrieves fetcher
+✅ Documents are processed successfully
+✅ No `FetcherNotFoundException`
+
+❌ **Failure:** Would indicate Ignite server/client communication issue
+
+## Advantages of Embedded Architecture
+
+| Aspect | Embedded Ignite | External Ignite Cluster |
+|--------|----------------|------------------------|
+| **Deployment** | Single Docker container | Multi-container setup |
+| **Configuration** | Automatic startup | Manual cluster management |
+| **Dependencies** | None (embedded) | Requires separate Ignite deployment |
+| **Use Cases** | Single-instance, dev/test | Production multi-instance
clusters |
+| **Complexity** | Low | Medium-High |
+
+## Troubleshooting
+
+**Container fails to start:**
+```bash
+docker logs <container-id>
+```
+
+**Test timeout:**
+- Increase `MAX_STARTUP_TIMEOUT` in `ExternalTestBase.java`
+- Check Docker resources (memory, CPU)
+
+**Ignite connection issues:**
+```bash
+# Check Ignite server started
+docker logs <container> | grep "Ignite server started"
+
+# Check topology
+docker logs <container> | grep "Topology snapshot"
+```
+
+**Server didn't start:**
+- Check logs for `"Starting embedded Ignite server"`
+- Verify tika-pipes-ignite plugin is in classpath
+- Check JAVA_OPTS has sufficient memory
+
+## Difference from FileSystemFetcherTest
+
+| Aspect | FileSystemFetcherTest | IgniteConfigStoreTest |
+|--------|----------------------|----------------------|
+| ConfigStore | File-based (`/tmp/tika-config-store.json`) | Embedded Ignite
(in-memory) |
+| Config File | `tika-config.json` | `tika-config-ignite.json` |
+| Architecture | File I/O | Embedded server + clients |
+| Use Case | Single-instance with persistence | In-process distributed cache |
+| External Deps | None | None (embedded!) |
+
+Both tests verify dynamic fetcher management works across JVM boundaries!
+
+## Production Deployment
+
+For production, you can:
+1. Use the embedded architecture (easiest)
+2. Run multiple tika-grpc instances - each starts its own Ignite server node
+3. Nodes auto-discover and form cluster
+4. Cache is replicated across all nodes
+5. No external Ignite deployment needed!
+
diff --git
a/tika-e2e-tests/tika-grpc/src/test/resources/docker-compose-ignite.yml
b/tika-e2e-tests/tika-grpc/src/test/resources/docker-compose-ignite.yml
new file mode 100644
index 000000000..a5cdea54a
--- /dev/null
+++ b/tika-e2e-tests/tika-grpc/src/test/resources/docker-compose-ignite.yml
@@ -0,0 +1,25 @@
+version: '3.8'
+
+services:
+ # Tika gRPC server with embedded Ignite server
+ tika-grpc:
+ image: apache/tika-grpc:local
+ ports:
+ - "50052:50052"
+ volumes:
+ - ${HOST_GOVDOCS1_DIR}:/tika/govdocs1:ro
+ - ./tika-config-ignite.json:/config/tika-config.json:ro
+ command: ["-c", "/config/tika-config.json"]
+ environment:
+ - JAVA_OPTS=-Xmx2g -XX:+UseG1GC
+ networks:
+ - tika-cluster
+ healthcheck:
+ test: ["CMD", "grpc_health_probe", "-addr=:50052"]
+ interval: 10s
+ timeout: 5s
+ retries: 3
+
+networks:
+ tika-cluster:
+ driver: bridge
diff --git a/tika-e2e-tests/tika-grpc/src/test/resources/docker-compose.yml
b/tika-e2e-tests/tika-grpc/src/test/resources/docker-compose.yml
new file mode 100644
index 000000000..03b6fe0ac
--- /dev/null
+++ b/tika-e2e-tests/tika-grpc/src/test/resources/docker-compose.yml
@@ -0,0 +1,16 @@
+version: '3.8'
+
+services:
+ tika-grpc:
+ image: apache/tika-grpc:local
+ ports:
+ - "50052:50052"
+ volumes:
+ - ${HOST_GOVDOCS1_DIR}:/tika/govdocs1:ro
+ - ./tika-config.json:/config/tika-config.json:ro
+ command: ["-c", "/config/tika-config.json"]
+ healthcheck:
+ test: ["CMD", "nc", "-z", "localhost", "50052"]
+ interval: 10s
+ timeout: 5s
+ retries: 5
diff --git a/tika-e2e-tests/tika-grpc/src/test/resources/log4j2.xml
b/tika-e2e-tests/tika-grpc/src/test/resources/log4j2.xml
new file mode 100644
index 000000000..31da8f50b
--- /dev/null
+++ b/tika-e2e-tests/tika-grpc/src/test/resources/log4j2.xml
@@ -0,0 +1,19 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<Configuration status="WARN">
+ <Appenders>
+ <Console name="Console" target="SYSTEM_OUT">
+ <PatternLayout pattern="%d{HH:mm:ss.SSS} [%t] %-5level %logger{36}
- %msg%n"/>
+ </Console>
+ </Appenders>
+ <Loggers>
+ <Root level="info">
+ <AppenderRef ref="Console"/>
+ </Root>
+ <Logger name="org.apache.tika" level="debug" additivity="false">
+ <AppenderRef ref="Console"/>
+ </Logger>
+ <Logger name="org.testcontainers" level="info" additivity="false">
+ <AppenderRef ref="Console"/>
+ </Logger>
+ </Loggers>
+</Configuration>
diff --git
a/tika-e2e-tests/tika-grpc/src/test/resources/tika-config-ignite.json
b/tika-e2e-tests/tika-grpc/src/test/resources/tika-config-ignite.json
new file mode 100644
index 000000000..2cca83cea
--- /dev/null
+++ b/tika-e2e-tests/tika-grpc/src/test/resources/tika-config-ignite.json
@@ -0,0 +1,52 @@
+{
+ "plugin-roots": ["/var/cache/tika/plugins"],
+ "pipes": {
+ "numClients": 1,
+ "configStoreType": "ignite",
+ "configStoreParams": "{\"cacheName\": \"tika-e2e-test\", \"cacheMode\":
\"REPLICATED\", \"igniteInstanceName\": \"TikaE2ETest\", \"autoClose\": true}",
+ "forkedJvmArgs": [
+ "--add-opens=java.base/jdk.internal.access=ALL-UNNAMED",
+ "--add-opens=java.base/jdk.internal.misc=ALL-UNNAMED",
+ "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED",
+ "--add-opens=java.base/sun.util.calendar=ALL-UNNAMED",
+ "--add-opens=java.management/com.sun.jmx.mbeanserver=ALL-UNNAMED",
+ "--add-opens=jdk.internal.jvmstat/sun.jvmstat.monitor=ALL-UNNAMED",
+
"--add-opens=java.base/sun.reflect.generics.reflectiveObjects=ALL-UNNAMED",
+ "--add-opens=jdk.management/com.sun.management.internal=ALL-UNNAMED",
+ "--add-opens=java.base/java.io=ALL-UNNAMED",
+ "--add-opens=java.base/java.nio=ALL-UNNAMED",
+ "--add-opens=java.base/java.net=ALL-UNNAMED",
+ "--add-opens=java.base/java.util=ALL-UNNAMED",
+ "--add-opens=java.base/java.util.concurrent=ALL-UNNAMED",
+ "--add-opens=java.base/java.util.concurrent.locks=ALL-UNNAMED",
+ "--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED",
+ "--add-opens=java.base/java.lang=ALL-UNNAMED",
+ "--add-opens=java.base/java.lang.invoke=ALL-UNNAMED",
+ "--add-opens=java.base/java.math=ALL-UNNAMED",
+ "--add-opens=java.sql/java.sql=ALL-UNNAMED",
+ "--add-opens=java.base/java.lang.reflect=ALL-UNNAMED",
+ "--add-opens=java.base/java.time=ALL-UNNAMED",
+ "--add-opens=java.base/java.text=ALL-UNNAMED",
+ "--add-opens=java.management/sun.management=ALL-UNNAMED",
+ "--add-opens=java.desktop/java.awt.font=ALL-UNNAMED"
+ ]
+ },
+ "fetchers": [
+ {
+ "fs": {
+ "staticFetcher": {
+ "basePath": "/tika/govdocs1"
+ }
+ }
+ }
+ ],
+ "emitters": [
+ {
+ "fs": {
+ "defaultEmitter": {
+ "basePath": "/tmp/output"
+ }
+ }
+ }
+ ]
+}
diff --git a/tika-e2e-tests/tika-grpc/src/test/resources/tika-config.json
b/tika-e2e-tests/tika-grpc/src/test/resources/tika-config.json
new file mode 100644
index 000000000..05173b6a2
--- /dev/null
+++ b/tika-e2e-tests/tika-grpc/src/test/resources/tika-config.json
@@ -0,0 +1,25 @@
+{
+ "plugin-roots": ["/var/cache/tika/plugins"],
+ "pipes": {
+ "configStoreType": "file",
+ "configStoreParams": "{\"path\": \"/tmp/tika-config-store.json\"}"
+ },
+ "fetchers": [
+ {
+ "fs": {
+ "defaultFetcher": {
+ "basePath": "/tika/govdocs1"
+ }
+ }
+ }
+ ],
+ "emitters": [
+ {
+ "fs": {
+ "defaultEmitter": {
+ "basePath": "/tmp/output"
+ }
+ }
+ }
+ ]
+}