[ 
https://issues.apache.org/jira/browse/TIKA-4547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-4547:
------------------------------------
    Description: 
Plan: Enable Distributed State Management for Tika Pipes Clustering

The current Tika Pipes architecture stores Fetcher/Emitter/PipesIterator 
configurations in local memory (ExpiringFetcherStore using synchronized 
HashMaps), making it impossible to create a fetcher on one server and use it on 
another. This plan introduces a pluggable distributed state abstraction to 
enable true clustering for both gRPC and REST servers.

* Create StateStore abstraction in tika-pipes-api as an interface with methods 
put(String key, byte[] value), get(String key), delete(String key), list(), and 
lifecycle operations, allowing pluggable implementations (in-memory, Apache 
Ignite, Redis, Hazelcast, etc.).

* Refactor ExpiringFetcherStore to use StateStore in TikaGrpcServerImpl.java, 
replacing Collections.synchronizedMap with StateStore API calls for fetchers, 
fetcherConfigs, and fetcherLastAccessed maps to enable cross-server state 
sharing.

* Create parallel EmitterStore and PipesIteratorStore abstractions mirroring 
ExpiringFetcherStore pattern in tika-pipes-core, applying the same 
StateStore-backed approach for Emitters and PipesIterators to achieve full 
component distribution.

* Add StateStoreFactory plugin system in tika-pipes-core using PF4J pattern 
(similar to FetcherManager and EmitterManager), loading implementations from 
Tika config's stateStore section with default in-memory implementation.

* Update PipesConfig to include state store configuration in PipesConfig.java 
with fields like stateStoreClass and stateStoreParams, ensuring backward 
compatibility with local-only deployments via sensible defaults.
Make PipesClient and PipesServer state-aware by injecting StateStore references 
in PipesClient.java and PipesServer.java, enabling forked processes to retrieve 
fetcher/emitter configs from distributed store rather than requiring XML 
rewrites.

 

  was:
see 

[https://github.com/nddipiazza/tika-pipes/tree/main/tika-pipes-grpc]

 

this project does tika-grpc with apache ignite to make clustering possible

need to make this change in tika-grpc otherwise tika-grpc must always be single 
process


> Update tika pipes so that it can be properly clustered
> ------------------------------------------------------
>
>                 Key: TIKA-4547
>                 URL: https://issues.apache.org/jira/browse/TIKA-4547
>             Project: Tika
>          Issue Type: Task
>            Reporter: Nicholas DiPiazza
>            Priority: Major
>
> Plan: Enable Distributed State Management for Tika Pipes Clustering
> The current Tika Pipes architecture stores Fetcher/Emitter/PipesIterator 
> configurations in local memory (ExpiringFetcherStore using synchronized 
> HashMaps), making it impossible to create a fetcher on one server and use it 
> on another. This plan introduces a pluggable distributed state abstraction to 
> enable true clustering for both gRPC and REST servers.
> * Create StateStore abstraction in tika-pipes-api as an interface with 
> methods put(String key, byte[] value), get(String key), delete(String key), 
> list(), and lifecycle operations, allowing pluggable implementations 
> (in-memory, Apache Ignite, Redis, Hazelcast, etc.).
> * Refactor ExpiringFetcherStore to use StateStore in TikaGrpcServerImpl.java, 
> replacing Collections.synchronizedMap with StateStore API calls for fetchers, 
> fetcherConfigs, and fetcherLastAccessed maps to enable cross-server state 
> sharing.
> * Create parallel EmitterStore and PipesIteratorStore abstractions mirroring 
> ExpiringFetcherStore pattern in tika-pipes-core, applying the same 
> StateStore-backed approach for Emitters and PipesIterators to achieve full 
> component distribution.
> * Add StateStoreFactory plugin system in tika-pipes-core using PF4J pattern 
> (similar to FetcherManager and EmitterManager), loading implementations from 
> Tika config's stateStore section with default in-memory implementation.
> * Update PipesConfig to include state store configuration in PipesConfig.java 
> with fields like stateStoreClass and stateStoreParams, ensuring backward 
> compatibility with local-only deployments via sensible defaults.
> Make PipesClient and PipesServer state-aware by injecting StateStore 
> references in PipesClient.java and PipesServer.java, enabling forked 
> processes to retrieve fetcher/emitter configs from distributed store rather 
> than requiring XML rewrites.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to