[
https://issues.apache.org/jira/browse/TIKA-4547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nicholas DiPiazza updated TIKA-4547:
------------------------------------
Description:
Plan: Enable Distributed State Management for Tika Pipes Clustering
The current Tika Pipes architecture stores Fetcher/Emitter/PipesIterator
configurations in local memory (ExpiringFetcherStore using synchronized
HashMaps), making it impossible to create a fetcher on one server and use it on
another. This plan introduces a pluggable distributed state abstraction to
enable true clustering for both gRPC and REST servers.
* Create StateStore abstraction in tika-pipes-api as an interface with methods
put(String key, byte[] value), get(String key), delete(String key), list(), and
lifecycle operations, allowing pluggable implementations (in-memory, Apache
Ignite, Redis, Hazelcast, etc.).
* Refactor ExpiringFetcherStore to use StateStore in TikaGrpcServerImpl.java,
replacing Collections.synchronizedMap with StateStore API calls for fetchers,
fetcherConfigs, and fetcherLastAccessed maps to enable cross-server state
sharing.
* Create parallel EmitterStore and PipesIteratorStore abstractions mirroring
ExpiringFetcherStore pattern in tika-pipes-core, applying the same
StateStore-backed approach for Emitters and PipesIterators to achieve full
component distribution.
* Add StateStoreFactory plugin system in tika-pipes-core using PF4J pattern
(similar to FetcherManager and EmitterManager), loading implementations from
Tika config's stateStore section with default in-memory implementation.
* Update PipesConfig to include state store configuration in PipesConfig.java
with fields like stateStoreClass and stateStoreParams, ensuring backward
compatibility with local-only deployments via sensible defaults.
Make PipesClient and PipesServer state-aware by injecting StateStore references
in PipesClient.java and PipesServer.java, enabling forked processes to retrieve
fetcher/emitter configs from distributed store rather than requiring XML
rewrites.
was:
see
[https://github.com/nddipiazza/tika-pipes/tree/main/tika-pipes-grpc]
this project does tika-grpc with apache ignite to make clustering possible
need to make this change in tika-grpc otherwise tika-grpc must always be single
process
> Update tika pipes so that it can be properly clustered
> ------------------------------------------------------
>
> Key: TIKA-4547
> URL: https://issues.apache.org/jira/browse/TIKA-4547
> Project: Tika
> Issue Type: Task
> Reporter: Nicholas DiPiazza
> Priority: Major
>
> Plan: Enable Distributed State Management for Tika Pipes Clustering
> The current Tika Pipes architecture stores Fetcher/Emitter/PipesIterator
> configurations in local memory (ExpiringFetcherStore using synchronized
> HashMaps), making it impossible to create a fetcher on one server and use it
> on another. This plan introduces a pluggable distributed state abstraction to
> enable true clustering for both gRPC and REST servers.
> * Create StateStore abstraction in tika-pipes-api as an interface with
> methods put(String key, byte[] value), get(String key), delete(String key),
> list(), and lifecycle operations, allowing pluggable implementations
> (in-memory, Apache Ignite, Redis, Hazelcast, etc.).
> * Refactor ExpiringFetcherStore to use StateStore in TikaGrpcServerImpl.java,
> replacing Collections.synchronizedMap with StateStore API calls for fetchers,
> fetcherConfigs, and fetcherLastAccessed maps to enable cross-server state
> sharing.
> * Create parallel EmitterStore and PipesIteratorStore abstractions mirroring
> ExpiringFetcherStore pattern in tika-pipes-core, applying the same
> StateStore-backed approach for Emitters and PipesIterators to achieve full
> component distribution.
> * Add StateStoreFactory plugin system in tika-pipes-core using PF4J pattern
> (similar to FetcherManager and EmitterManager), loading implementations from
> Tika config's stateStore section with default in-memory implementation.
> * Update PipesConfig to include state store configuration in PipesConfig.java
> with fields like stateStoreClass and stateStoreParams, ensuring backward
> compatibility with local-only deployments via sensible defaults.
> Make PipesClient and PipesServer state-aware by injecting StateStore
> references in PipesClient.java and PipesServer.java, enabling forked
> processes to retrieve fetcher/emitter configs from distributed store rather
> than requiring XML rewrites.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)