[ 
https://issues.apache.org/jira/browse/TIKA-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17811962#comment-17811962
 ] 

Tim Allison commented on TIKA-4186:
-----------------------------------

Thank you, [~itais] for opening this issue. Unless I misunderstand, this is 
exactly how tika-server is designed to handle OOM/crashes, etc. The parsing is 
happening in the same jvm and process as tika-server -- so if one thread hangs, 
hits oom -- then the full server process and all existing parsing threads and 
connections are terminated.

It is not ideal. If you are running multithreaded against a single server, you 
cannot know which call to tika-server caused the thing to crash, so your client 
has to have some retry logic.

While not ideal, the current behavior is slightly better than the default 
behavior in Tika 1.x, which was to hit an OOM and silently ignore it or to 
allow zombie threads/infinite loops or to simply crash without a restart. So, 
the tika 2.x behaviour is a stepwise improvement.

The tika /pipes and /async endpoints are the "next gen" alternative, and they 
run the parse in a forked process outside of the tika-server process so the 
server should always remain "on". 

If I've misunderstood the behavior you describe, please let me know. If you 
need help migrating to tika /pipes, please ask on the user list. The beginning 
of documentation is here: 
https://cwiki.apache.org/confluence/display/tika/tika-pipes

> tika server shut down innocent connections
> ------------------------------------------
>
>                 Key: TIKA-4186
>                 URL: https://issues.apache.org/jira/browse/TIKA-4186
>             Project: Tika
>          Issue Type: Bug
>          Components: tika-server
>    Affects Versions: 2.9.1
>         Environment: macOS running tika-server-standard-2.9.1.jar
>            Reporter: Itai
>            Priority: Major
>
> The Tika server shuts down and restarts in case of an issue (OOM, crash, 
> timeout).
> When tika server shut down, all active connections are being closed.
> A single connection can cause a side effect on other connections.
> This makes it hard to make parallel calls to a single server in a production 
> environment.
> How to reproduce?
>  - prepare a large sample.pdf file that takes more then 30secs to digest.
> run:
> java -jar ~/Downloads/tika-server-standard-2.9.1.jar
> —
> terminal 2 run:
> curl -v -T sample.pdf  [http://localhost:9998/tika] --header "Accept: 
> text/plain" --header "X-Tika-Timeout-Millis: 30001"
> —
> wait ~20-25 seconds
> —
> terminal 3 run:
> curl -v -T sample.pdf  [http://localhost:9998/tika] --header "Accept: 
> text/plain"
> Expected result:
>  - terminal 2 connection should timeout after 30 secs
>  - terminal 3 connection should not timeout and return successfully.
> Actual result:
>  - both curl commands fails after 30 secs.
> logs:
> ```
> INFO  [qtp486662053-44] 11:57:30,251 
> org.apache.tika.server.core.resource.TikaResource /tika (autodetecting type)
> WARN  [qtp486662053-44] 11:57:30,278 org.apache.pdfbox.pdfparser.BaseParser 
> Empty COSName at offset 628452
> ERROR [Thread-21] 11:57:37,566 
> org.apache.tika.server.core.ServerStatusWatcher Timeout task PARSE, millis 
> elapsed 30014; consider increasing the allowable time with the 
> <taskTimeoutMillis/> parameter or the X-Tika-Timeout-Millis header
> WARN  [Thread-21] 11:57:37,573 
> org.apache.tika.server.core.ServerStatusWatcher forked process observed 
> TIMEOUT and is shutting down.
> INFO  [Thread-21] 11:57:37,613 
> org.apache.tika.server.core.ServerStatusWatcher Shutting down forked process 
> with status: TIMEOUT
> INFO  [pool-2-thread-1] 11:57:38,039 
> org.apache.tika.server.core.TikaServerWatchDog forked process exited with 
> exit value 3
> INFO  [main] 11:57:39,340 org.apache.tika.server.core.TikaServerProcess 
> Starting Apache Tika 2.9.1 server
> INFO  [main] 11:57:39,564 org.apache.tika.server.core.TikaServerProcess 
> loading resource from SPI: class 
> org.apache.tika.server.standard.resource.XMPMetadataResource
> Jan 29, 2024 11:57:39 AM org.apache.cxf.endpoint.ServerImpl initDestination
> INFO: Setting the server's publish address to be [http://localhost:9998/]
> INFO  [main] 11:57:39,747 org.eclipse.jetty.util.log Logging initialized 
> @1640ms to org.eclipse.jetty.util.log.Slf4jLog
> INFO  [main] 11:57:39,790 org.eclipse.jetty.server.Server 
> jetty-9.4.53.v20231009; built: 2023-10-09T12:29:09.265Z; git: 
> 27bde00a0b95a1d5bbee0eae7984f891d2d0f8c9; jvm 21.0.1
> INFO  [main] 11:57:39,833 org.eclipse.jetty.server.AbstractConnector Started 
> ServerConnector@48bfb884\{HTTP/1.1, (http/1.1)}
> {localhost:9998}
> INFO  [main] 11:57:39,833 org.eclipse.jetty.server.Server Started @1729ms
> ```
> —
> ```
>  *   Trying 127.0.0.1:9998...
>  * Connected to localhost (127.0.0.1) port 9998 (#0)
> > PUT /tika HTTP/1.1
> > Host: localhost:9998
> > User-Agent: curl/7.85.0
> > Accept: text/plain
> > Content-Length: 636978
> > Expect: 100-continue
> >
>  * Mark bundle as not supporting multiuse
> < HTTP/1.1 100 Continue
>  * We are completely uploaded and fine
>  * Empty reply from server
>  * Closing connection 0
> curl: (52) Empty reply from server
> ```
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to