[ 
https://issues.apache.org/jira/browse/TIKA-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18047940#comment-18047940
 ] 

ASF GitHub Bot commented on TIKA-4595:
--------------------------------------

nddipiazza opened a new pull request, #2489:
URL: https://github.com/apache/tika/pull/2489

   ## Overview
   
   This PR implements dynamic fetcher and emitter management for Apache Tika 
Pipes through a new ConfigStore abstraction, enabling runtime configuration 
changes without restarts.
   
   ## Key Features
   
   ### 1. Dynamic Configuration Management API
   - **New Methods in PipesClient:**
     - `saveFetcher(FetcherConfig)` - Save fetcher at runtime
     - `saveFetcher(String fetcherId, byte[] config)` - Save from serialized 
config
     - `deleteFetcher(String fetcherId)` - Remove fetcher
     - `updateFetcher(String fetcherId, byte[] config)` - Update existing 
fetcher
     
   ### 2. ConfigStore Abstraction
   **Built-in Implementations:**
   - **InMemoryConfigStore** - Fast, ephemeral (default)
   - **FileBasedConfigStore** - JSON persistence, cross-JVM sharing (new)
   - **IgniteConfigStore** - Distributed cache for multi-instance (enhanced)
   
   ### 3. Cross-JVM Configuration Sharing
   - PipesClient → gRPC Server → forked PipesServer
   - Fetchers saved via gRPC are available to all workers
   - Configuration survives process restarts (file/Ignite modes)
   
   ### 4. gRPC API Integration
   - `SaveFetcher` RPC endpoint
   - `DeleteFetcher` RPC endpoint  
   - `UpdateFetcher` RPC endpoint
   - Fully integrated with TikaGrpcServer
   
   ## Implementation Details
   
   ### FileBasedConfigStore
   - Thread-safe JSON file persistence
   - Location: configurable via `path` parameter
   - Atomic writes with temp file + rename
   - Automatic initialization from tika-config.json
   
   ### IgniteConfigStore Enhancements
   - Embedded server architecture (no external Ignite needed)
   - Client-only mode for workers
   - Configurable cache mode (REPLICATED/PARTITIONED)
   - JVM arguments for Java module access
   
   ### Configuration Examples
   
   **File-Based:**
   ```json
   {
     "pipes": {
       "configStoreType": "file",
       "configStoreParams": "{\"path\": \"/tmp/tika-config-store.json\"}"
     }
   }
   ```
   
   **Ignite:**
   ```json
   {
     "pipes": {
       "configStoreType": "ignite",
       "configStoreParams": "{\"cacheName\": \"tika-config\", \"cacheMode\": 
\"REPLICATED\"}",
       "forkedJvmArgs": [
         "--add-opens=java.base/java.nio=ALL-UNNAMED",
         "--add-opens=java.base/java.util=ALL-UNNAMED"
       ]
     }
   }
   ```
   
   ## Testing
   
   ### E2E Tests Added
   - **FileSystemFetcherTest** - File-based ConfigStore with dynamic fetcher 
management ✅
   - **IgniteConfigStoreTest** - Ignite ConfigStore with embedded server ✅
   - Document limit feature: `-Dcorpa.numdocs=N`
   
   Both tests verify:
   - Dynamic fetcher creation via gRPC
   - Cross-JVM config propagation
   - Successful document processing
   - Proper cleanup
   
   ## Backward Compatibility
   
   ✅ **Fully backward compatible**
   - Default behavior unchanged (InMemoryConfigStore)
   - Existing configs work without modification
   - Optional feature - enabled via `configStoreType`
   
   ## Related Issues
   
   Fixes: TIKA-4595
   
   ## Migration Guide
   
   For users wanting dynamic fetcher management:
   
   1. **File-based (recommended for single-instance):**
      - Add `configStoreType: file` to pipes config
      - Fetchers persist across restarts
   
   2. **Ignite (for multi-instance/distributed):**
      - Add `configStoreType: ignite` 
      - Add required JVM arguments
      - Ideal for Kubernetes/multi-pod deployments
   
   ## Files Changed
   
   - **Core:** PipesClient, PipesServer, ConfigStore abstraction
   - **gRPC:** TikaGrpcServerImpl with new RPCs
   - **Stores:** FileBasedConfigStore, IgniteConfigStore enhancements
   - **Plugins:** ExtensionConfig made Serializable
   - **Tests:** E2E tests with document limit feature
   
   ## Performance Impact
   
   - Minimal overhead (lazy initialization)
   - File-based: ~1-2ms per save operation
   - Ignite: sub-millisecond after warm-up
   - No impact on existing in-memory mode




> Add dynamic fetcher management API to PipesClient
> -------------------------------------------------
>
>                 Key: TIKA-4595
>                 URL: https://issues.apache.org/jira/browse/TIKA-4595
>             Project: Tika
>          Issue Type: New Feature
>          Components: tika-pipes
>            Reporter: Nicholas DiPiazza
>            Assignee: Nicholas DiPiazza
>            Priority: Major
>
> h2. Overview
> Add API to PipesClient for dynamically creating, updating, and deleting 
> fetchers at runtime through PipesServer's ConfigStore.
> h2. Current State
> * PipesServer already has ConfigStore infrastructure
> * FetcherManager and EmitterManager support runtime modifications
> * But PipesClient has no API to expose these capabilities to users
> h2. Desired Architecture
> {noformat}
> PipesClient API
>     ↓
> PipesServer (forked process)
>     ↓
> ConfigStore (memory, Ignite, etc.)
> {noformat}
> h2. Requirements
> # PipesClient provides public API for fetcher CRUD operations
> # All operations are sent to PipesServer via socket protocol  
> # PipesServer handles requests and updates ConfigStore
> # Static fetchers from tika-config.xml/json loaded at startup
> # Dynamic fetchers managed through ConfigStore
> # Both static and dynamic fetchers available for use
> h2. Benefits
> * Users can add/modify fetchers without restarting
> * Supports multi-tenant scenarios with isolated fetcher configs
> * Enables programmatic fetcher configuration
> * Maintains backwards compatibility with static config
> h2. Implementation Tasks
> See linked sub-tasks for detailed implementation steps.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to