[
https://issues.apache.org/jira/browse/TIKA-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237679#comment-17237679
]
Tim Allison commented on TIKA-3221:
-----------------------------------
So, short of returning text extracted so far, I think it would still be useful
to know which file caused a timeout/oom. At this point, in a multithreaded
environment, I don't think there's a good way to know which file caused the
timeout.
We can add a timeout thread per parse, and the overhead is minimal. For a
"heavy hang" parse of 100 ms per file, with 10 threads and 100 calls per
thread, the total elapsed time without a timeout thread (now) is 124644 ms, and
the time with a timeout thread is 125634 ms (averaged over 5 runs). The
wallclock times are indistinguishable at around 11700 ms.
> /rmeta/text endpoint - allow a "max parse time" parameter where after
> exceeded, return bytes/metadata mangaed to get up to that point
> -------------------------------------------------------------------------------------------------------------------------------------
>
> Key: TIKA-3221
> URL: https://issues.apache.org/jira/browse/TIKA-3221
> Project: Tika
> Issue Type: Bug
> Reporter: Nicholas DiPiazza
> Priority: Major
>
> Can we make a change to the
> {code}
> /rmeta/text
> {code}
> endpoint to allow a "max parse time" parameter where after exceeded, return
> bytes/metadata managed to get up to that point.
> Motivation:
> I have a massive number of documents that I need to fetch through apache tika
> server.
> Prior to making a switch to tika server, I used a project I created myself
> https://github.com/nddipiazza/tika-fork that created tika forked VMs and
> would send work to the VMs through sockets directly.
> This was OK but super complicated so I chose to switch to the Tika jetty
> server for simplicity's sake.
> Tika Server works great for the most part for this use case... But one
> feature I had before was that I could say "If I don't get a result within
> MAX_PARSE_TIMEOUT_MS, then stop parsing at that moment and return the bytes
> we managed to get up to that point.
> This is because with the massive number of documents I need to parse, I
> cannot afford to have any parse hang longer than a certain amount of time.
> But conversely, if I make timeout 20 seconds, then I suffer massive gaps with
> *no* content at all.
> With the rmeta/text method, we recently added the ability to send a
> writeLimit where we will stop parsing after we reach that number of bytes.
> I'm hoping we can do the same for the time parsed. Perhaps when checking byte
> size, periodically check time and quit parser in the same way.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)