[
https://issues.apache.org/jira/browse/TIKA-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18047940#comment-18047940
]
ASF GitHub Bot commented on TIKA-4595:
--------------------------------------
nddipiazza opened a new pull request, #2489:
URL: https://github.com/apache/tika/pull/2489
## Overview
This PR implements dynamic fetcher and emitter management for Apache Tika
Pipes through a new ConfigStore abstraction, enabling runtime configuration
changes without restarts.
## Key Features
### 1. Dynamic Configuration Management API
- **New Methods in PipesClient:**
- `saveFetcher(FetcherConfig)` - Save fetcher at runtime
- `saveFetcher(String fetcherId, byte[] config)` - Save from serialized
config
- `deleteFetcher(String fetcherId)` - Remove fetcher
- `updateFetcher(String fetcherId, byte[] config)` - Update existing
fetcher
### 2. ConfigStore Abstraction
**Built-in Implementations:**
- **InMemoryConfigStore** - Fast, ephemeral (default)
- **FileBasedConfigStore** - JSON persistence, cross-JVM sharing (new)
- **IgniteConfigStore** - Distributed cache for multi-instance (enhanced)
### 3. Cross-JVM Configuration Sharing
- PipesClient → gRPC Server → forked PipesServer
- Fetchers saved via gRPC are available to all workers
- Configuration survives process restarts (file/Ignite modes)
### 4. gRPC API Integration
- `SaveFetcher` RPC endpoint
- `DeleteFetcher` RPC endpoint
- `UpdateFetcher` RPC endpoint
- Fully integrated with TikaGrpcServer
## Implementation Details
### FileBasedConfigStore
- Thread-safe JSON file persistence
- Location: configurable via `path` parameter
- Atomic writes with temp file + rename
- Automatic initialization from tika-config.json
### IgniteConfigStore Enhancements
- Embedded server architecture (no external Ignite needed)
- Client-only mode for workers
- Configurable cache mode (REPLICATED/PARTITIONED)
- JVM arguments for Java module access
### Configuration Examples
**File-Based:**
```json
{
"pipes": {
"configStoreType": "file",
"configStoreParams": "{\"path\": \"/tmp/tika-config-store.json\"}"
}
}
```
**Ignite:**
```json
{
"pipes": {
"configStoreType": "ignite",
"configStoreParams": "{\"cacheName\": \"tika-config\", \"cacheMode\":
\"REPLICATED\"}",
"forkedJvmArgs": [
"--add-opens=java.base/java.nio=ALL-UNNAMED",
"--add-opens=java.base/java.util=ALL-UNNAMED"
]
}
}
```
## Testing
### E2E Tests Added
- **FileSystemFetcherTest** - File-based ConfigStore with dynamic fetcher
management ✅
- **IgniteConfigStoreTest** - Ignite ConfigStore with embedded server ✅
- Document limit feature: `-Dcorpa.numdocs=N`
Both tests verify:
- Dynamic fetcher creation via gRPC
- Cross-JVM config propagation
- Successful document processing
- Proper cleanup
## Backward Compatibility
✅ **Fully backward compatible**
- Default behavior unchanged (InMemoryConfigStore)
- Existing configs work without modification
- Optional feature - enabled via `configStoreType`
## Related Issues
Fixes: TIKA-4595
## Migration Guide
For users wanting dynamic fetcher management:
1. **File-based (recommended for single-instance):**
- Add `configStoreType: file` to pipes config
- Fetchers persist across restarts
2. **Ignite (for multi-instance/distributed):**
- Add `configStoreType: ignite`
- Add required JVM arguments
- Ideal for Kubernetes/multi-pod deployments
## Files Changed
- **Core:** PipesClient, PipesServer, ConfigStore abstraction
- **gRPC:** TikaGrpcServerImpl with new RPCs
- **Stores:** FileBasedConfigStore, IgniteConfigStore enhancements
- **Plugins:** ExtensionConfig made Serializable
- **Tests:** E2E tests with document limit feature
## Performance Impact
- Minimal overhead (lazy initialization)
- File-based: ~1-2ms per save operation
- Ignite: sub-millisecond after warm-up
- No impact on existing in-memory mode
> Add dynamic fetcher management API to PipesClient
> -------------------------------------------------
>
> Key: TIKA-4595
> URL: https://issues.apache.org/jira/browse/TIKA-4595
> Project: Tika
> Issue Type: New Feature
> Components: tika-pipes
> Reporter: Nicholas DiPiazza
> Assignee: Nicholas DiPiazza
> Priority: Major
>
> h2. Overview
> Add API to PipesClient for dynamically creating, updating, and deleting
> fetchers at runtime through PipesServer's ConfigStore.
> h2. Current State
> * PipesServer already has ConfigStore infrastructure
> * FetcherManager and EmitterManager support runtime modifications
> * But PipesClient has no API to expose these capabilities to users
> h2. Desired Architecture
> {noformat}
> PipesClient API
> ↓
> PipesServer (forked process)
> ↓
> ConfigStore (memory, Ignite, etc.)
> {noformat}
> h2. Requirements
> # PipesClient provides public API for fetcher CRUD operations
> # All operations are sent to PipesServer via socket protocol
> # PipesServer handles requests and updates ConfigStore
> # Static fetchers from tika-config.xml/json loaded at startup
> # Dynamic fetchers managed through ConfigStore
> # Both static and dynamic fetchers available for use
> h2. Benefits
> * Users can add/modify fetchers without restarting
> * Supports multi-tenant scenarios with isolated fetcher configs
> * Enables programmatic fetcher configuration
> * Maintains backwards compatibility with static config
> h2. Implementation Tasks
> See linked sub-tasks for detailed implementation steps.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)