[
https://issues.apache.org/jira/browse/TIKA-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18053093#comment-18053093
]
Tim Allison commented on TIKA-4626:
-----------------------------------
Enriched benchmarks to include short and large input+output. These are run in a
single thread.
short-10ms = parse time of 10 ms
long-500ms = parse time of 500ms
small-10 = 10 reps of a small string
large-10000 = 10k reps of that string = 1MB
*legacy*
==========================================================================================
SUMMARY
==========================================================================================
THROUGHPUT (req/s):
small-10 large-10000
short-10ms 63.54 45.01
long-500ms 1.98 1.96
AVG LATENCY (ms):
small-10 large-10000
short-10ms 15.16 21.64
long-500ms 503.46 508.38
P95 LATENCY (ms):
small-10 large-10000
short-10ms 17.00 26.00
long-500ms 505.00 510.00
==========================================================================================
*pipes*
==========================================================================================
SUMMARY
==========================================================================================
THROUGHPUT (req/s):
small-10 large-10000
short-10ms 50.01 32.55
long-500ms 1.97 1.94
AVG LATENCY (ms):
small-10 large-10000
short-10ms 19.39 30.16
long-500ms 508.23 515.87
P95 LATENCY (ms):
small-10 large-10000
short-10ms 23.00 37.00
long-500ms 513.00 519.00
Claude's analysis based on the above:
{noformat}
Pipes Overhead:
┌─────────────┬──────────────┬────────────────────┐
│ │ small output │ large output (1MB) │
├─────────────┼──────────────┼────────────────────┤
│ short parse │ +4.2ms │ +8.5ms │
├─────────────┼──────────────┼────────────────────┤
│ long parse │ +4.8ms │ +7.5ms │
└─────────────┴──────────────┴────────────────────┘
Breakdown:
- Fixed overhead: ~4-5ms (IPC, temp file I/O, thread coordination)
- Size-dependent: ~3-4ms extra for 1MB (serialization of larger
metadata/content) Throughput impact:
- Short/small: 63→50 req/s (21% reduction)
- Short/large: 45→33 req/s (28% reduction)
- Long parse: essentially identical (~2 req/s both) Bottom line: For real
documents that take 100ms+ to parse, the 4-8ms overhead is <5% - negligible.
The overhead only matters for trivially fast parses. The robustness gain
(server survives crashes/OOMs) is worth it {noformat}
> Consider using tika-pipes in the backend for /rmeta and /tika endpoints i n4.x
> ------------------------------------------------------------------------------
>
> Key: TIKA-4626
> URL: https://issues.apache.org/jira/browse/TIKA-4626
> Project: Tika
> Issue Type: Task
> Components: tika-server
> Reporter: Tim Allison
> Priority: Major
> Attachments: tika-pipes-integration-plan.md
>
>
> In 4.x, we're consolidating the forking options to pipes parser. We've
> removed the "fork the entire server" option in main. We should consider
> swapping in tika pipes, writing to a tmp file, for /rmeta and /tika.
> This will prevent the entire server going down on oom, etc.
> If users want crashability, perhaps we add back in a /tika-legacy endpoint?
> I'm attaching the plan that I worked out with claude.
> We can do the same for /meta and /unpack on a separate ticket.
> Any concerns?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
