[
https://issues.apache.org/jira/browse/TIKA-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18053053#comment-18053053
]
Tim Allison commented on TIKA-4626:
-----------------------------------
Claude benchmarked the diff and found that we should turn off the nagle
algorithm in tcp (
socket.setTcpNoDelay(true)). This brought the overhead down to ~6ms per file.
I think this is acceptable, and, frankly, really surprising... in a good way.
>From claude:
{noformat}
Overhead Analysis:
┌────────────────────────┬──────────────┬──────────────┬───────────┐
│ Metric │ Legacy │ Pipes │ Overhead │
├────────────────────────┼──────────────┼──────────────┼───────────┤
│ Short-sleep latency │ 15.93ms │ 22.22ms │ ~6ms │
├────────────────────────┼──────────────┼──────────────┼───────────┤
│ Long-sleep latency │ 504.12ms │ 509.61ms │ ~5.5ms │
├────────────────────────┼──────────────┼──────────────┼───────────┤
│ Short-sleep throughput │ 241.65 req/s │ 170.83 req/s │ 29% lower │
├────────────────────────┼──────────────┼──────────────┼───────────┤
│ Long-sleep throughput │ 7.93 req/s │ 7.83 req/s │ ~1% lower │
└────────────────────────┴──────────────┴──────────────┴───────────┘
Before vs After Nagle Fix:
- Before: ~120ms overhead (3 × 40ms socket delays)
- After: ~6ms overhead
- Improvement: 20x reduction in IPC overhead Interpretation:
- For short operations (10ms): 29% throughput reduction - noticeable but
acceptable
- For long operations (500ms): ~1% throughput reduction - negligible
- For real-world parsing (typically 100ms-10s): overhead becomes
insignificant The ~6ms remaining overhead is from:
- Serialization: ~1.2ms
- Temp file I/O: ~0.5ms
- Socket I/O: ~0.2ms
- Thread/process coordination: ~4ms
{noformat}
> Consider using tika-pipes in the backend for /rmeta and /tika endpoints i n4.x
> ------------------------------------------------------------------------------
>
> Key: TIKA-4626
> URL: https://issues.apache.org/jira/browse/TIKA-4626
> Project: Tika
> Issue Type: Task
> Components: tika-server
> Reporter: Tim Allison
> Priority: Major
> Attachments: tika-pipes-integration-plan.md
>
>
> In 4.x, we're consolidating the forking options to pipes parser. We've
> removed the "fork the entire server" option in main. We should consider
> swapping in tika pipes, writing to a tmp file, for /rmeta and /tika.
> This will prevent the entire server going down on oom, etc.
> If users want crashability, perhaps we add back in a /tika-legacy endpoint?
> I'm attaching the plan that I worked out with claude.
> We can do the same for /meta and /unpack on a separate ticket.
> Any concerns?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)