tika endpoints i n4.x

Tim Allison (Jira) Tue, 20 Jan 2026 09:03:16 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18053093#comment-18053093
 ]


Tim Allison commented on TIKA-4626:
-----------------------------------

Enriched benchmarks to include short and large input+output. These are run in a 
single thread.

 

short-10ms = parse time of 10 ms

long-500ms = parse time of 500ms

small-10 = 10 reps of a small string

large-10000 = 10k reps of that string = 1MB

*legacy*                                                                        
                                                                                
                                                                                
  
                                                                                
                                                                                
                                                                                
  
  
==========================================================================================
                                                                                
                                                                      
  SUMMARY                                                                       
                                                                                
                                                                                
  
  
==========================================================================================
                                                                                
                                                                      
                                                                                
                                                                                
                                                                                
  
  THROUGHPUT (req/s):                                                           
                                                                                
                                                                                
  
  small-10          large-10000                                                 
                                                                                
                                                                                
  
  short-10ms                          63.54                45.01                
                                                                                
                                                                                
  
  long-500ms                           1.98                 1.96                
                                                                                
                                                                                
  
                                                                                
                                                                                
                                                                                
  
  AVG LATENCY (ms):                                                             
                                                                                
                                                                                
  
  small-10          large-10000                                                 
                                                                                
                                                                                
  
  short-10ms                          15.16                21.64                
                                                                                
                                                                                
  
  long-500ms                         503.46               508.38                
                                                                                
                                                                                
  
                                                                                
                                                                                
                                                                                
  
  P95 LATENCY (ms):                                                             
                                                                                
                                                                                
  
  small-10          large-10000                                                 
                                                                                
                                                                                
  
  short-10ms                          17.00                26.00                
                                                                                
                                                                                
  
  long-500ms                         505.00               510.00                
                                                                                
                                                                                
  
                                                                                
                                                                                
                                                                                
  
  
==========================================================================================
                                                                                
                                                                      
                                                                                
                                                                                
                                                                                
  
                                                                                
                                                                                
                                                                                
  
  *pipes*                                                                       
                                                                                
                                                                                
    
  
==========================================================================================
                                                                                
                                                                      
  SUMMARY                                                                       
                                                                                
                                                                                
  
  
==========================================================================================
                                                                                
                                                                      
                                                                                
                                                                                
                                                                                
  
  THROUGHPUT (req/s):                                                           
                                                                                
                                                                                
  
  small-10          large-10000                                                 
                                                                                
                                                                                
  
  short-10ms                          50.01                32.55                
                                                                                
                                                                                
  
  long-500ms                           1.97                 1.94                
                                                                                
                                                                                
  
                                                                                
                                                                                
                                                                                
  
  AVG LATENCY (ms):                                                             
                                                                                
                                                                                
  
  small-10          large-10000                                                 
                                                                                
                                                                                
  
  short-10ms                          19.39                30.16                
                                                                                
                                                                                
  
  long-500ms                         508.23               515.87                
                                                                                
                                                                                
  
                                                                                
                                                                                
                                                                                
  
  P95 LATENCY (ms):                                                             
                                                                                
                                                                                
  
  small-10          large-10000                                                 
                                                                                
                                                                                
  
  short-10ms                          23.00                37.00                
                                                                                
                                                                                
  
  long-500ms                         513.00               519.00                
                                                                                
                                                                                
  
                                                                                
                        

Claude's analysis based on the above:
{noformat}
Pipes Overhead:
  ┌─────────────┬──────────────┬────────────────────┐
  │             │ small output │ large output (1MB) │
  ├─────────────┼──────────────┼────────────────────┤
  │ short parse │ +4.2ms       │ +8.5ms             │
  ├─────────────┼──────────────┼────────────────────┤
  │ long parse  │ +4.8ms       │ +7.5ms             │
  └─────────────┴──────────────┴────────────────────┘
  Breakdown:
  - Fixed overhead: ~4-5ms (IPC, temp file I/O, thread coordination)
  - Size-dependent: ~3-4ms extra for 1MB (serialization of larger 
metadata/content)  Throughput impact:
  - Short/small: 63→50 req/s (21% reduction)
  - Short/large: 45→33 req/s (28% reduction)
  - Long parse: essentially identical (~2 req/s both)  Bottom line: For real 
documents that take 100ms+ to parse, the 4-8ms overhead is <5% - negligible. 
The overhead only matters for trivially fast parses. The robustness gain 
(server survives crashes/OOMs) is worth it {noformat}

> Consider using tika-pipes in the backend for /rmeta and /tika endpoints i n4.x
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-4626
>                 URL: https://issues.apache.org/jira/browse/TIKA-4626
>             Project: Tika
>          Issue Type: Task
>          Components: tika-server
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: tika-pipes-integration-plan.md
>
>
> In 4.x, we're consolidating the forking options to pipes parser. We've 
> removed the "fork the entire server" option in main. We should consider 
> swapping in tika pipes, writing to a tmp file, for /rmeta and /tika.
> This will prevent the entire server going down on oom, etc.
> If users want crashability, perhaps we add back in a /tika-legacy endpoint?
> I'm attaching the plan that I worked out with claude.
> We can do the same for /meta and /unpack on a separate ticket.
> Any concerns?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4626) Consider using tika-pipes in the backend for /rmeta and /tika endpoints i n4.x

Reply via email to