[
https://issues.apache.org/jira/browse/TIKA-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Itai updated TIKA-4186:
---
Description:
The Tika server shuts down and restarts in case of an issue (OOM, crash,
timeout).
When tika server shut down, all active connections are being closed.
A single connection can cause a side effect on other connections.
This makes it hard to make parallel calls to a single server in a production
environment.
How to reproduce?
- prepare a large sample.pdf file that takes more then 30secs to digest.
run:
java -jar ~/Downloads/tika-server-standard-2.9.1.jar
—
terminal 2 run:
curl -v -T sample.pdf [http://localhost:9998/tika] --header "Accept:
text/plain" --header "X-Tika-Timeout-Millis: 30001"
—
wait ~20-25 seconds
—
terminal 3 run:
curl -v -T sample.pdf [http://localhost:9998/tika] --header "Accept:
text/plain"
Expected result:
- terminal 2 connection should timeout after 30 secs
- terminal 3 connection should not timeout and return successfully.
Actual result:
- both curl commands fails after 30 secs.
logs:
```
INFO [qtp486662053-44] 11:57:30,251
org.apache.tika.server.core.resource.TikaResource /tika (autodetecting type)
WARN [qtp486662053-44] 11:57:30,278 org.apache.pdfbox.pdfparser.BaseParser
Empty COSName at offset 628452
ERROR [Thread-21] 11:57:37,566 org.apache.tika.server.core.ServerStatusWatcher
Timeout task PARSE, millis elapsed 30014; consider increasing the allowable
time with the parameter or the X-Tika-Timeout-Millis header
WARN [Thread-21] 11:57:37,573 org.apache.tika.server.core.ServerStatusWatcher
forked process observed TIMEOUT and is shutting down.
INFO [Thread-21] 11:57:37,613 org.apache.tika.server.core.ServerStatusWatcher
Shutting down forked process with status: TIMEOUT
INFO [pool-2-thread-1] 11:57:38,039
org.apache.tika.server.core.TikaServerWatchDog forked process exited with exit
value 3
INFO [main] 11:57:39,340 org.apache.tika.server.core.TikaServerProcess
Starting Apache Tika 2.9.1 server
INFO [main] 11:57:39,564 org.apache.tika.server.core.TikaServerProcess loading
resource from SPI: class
org.apache.tika.server.standard.resource.XMPMetadataResource
Jan 29, 2024 11:57:39 AM org.apache.cxf.endpoint.ServerImpl initDestination
INFO: Setting the server's publish address to be [http://localhost:9998/]
INFO [main] 11:57:39,747 org.eclipse.jetty.util.log Logging initialized
@1640ms to org.eclipse.jetty.util.log.Slf4jLog
INFO [main] 11:57:39,790 org.eclipse.jetty.server.Server
jetty-9.4.53.v20231009; built: 2023-10-09T12:29:09.265Z; git:
27bde00a0b95a1d5bbee0eae7984f891d2d0f8c9; jvm 21.0.1
INFO [main] 11:57:39,833 org.eclipse.jetty.server.AbstractConnector Started
ServerConnector@48bfb884\{HTTP/1.1, (http/1.1)}
{localhost:9998}
INFO [main] 11:57:39,833 org.eclipse.jetty.server.Server Started @1729ms
```
—
```
* Trying 127.0.0.1:9998...
* Connected to localhost (127.0.0.1) port 9998 (#0)
> PUT /tika HTTP/1.1
> Host: localhost:9998
> User-Agent: curl/7.85.0
> Accept: text/plain
> Content-Length: 636978
> Expect: 100-continue
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 100 Continue
* We are completely uploaded and fine
* Empty reply from server
* Closing connection 0
curl: (52) Empty reply from server
```
was:
The Tika server shuts down and restarts in case of an issue (OOM, crash,
timeout).
When tika server shut down, all active connections are being closed.
A single connection can cause a side effect on other connections.
This makes it hard to make parallel calls to a single server in a production
environment.
How to reproduce?
- prepare a large sample.pdf file that takes more then 30secs to digest.
run:
java -jar ~/Downloads/tika-server-standard-2.9.1.jar
---
terminal 2 run:
curl -v -T sample.pdf http://localhost:9998/tika --header "Accept: text/plain"
--header "X-Tika-Timeout-Millis: 30001"
---
wait ~20-25 seconds
---
terminal 3 run:
curl -v -T sample.pdf http://localhost:9998/tika --header "Accept: text/plain"
Expected result:
- terminal 2 connection should timeout after 30 secs
- terminal 3 connection should not timeout and return successfuly.
Actual result:
- both curl commends fails after 30 secs.
logs:
```
INFO [qtp486662053-44] 11:57:30,251
org.apache.tika.server.core.resource.TikaResource /tika (autodetecting type)
WARN [qtp486662053-44] 11:57:30,278 org.apache.pdfbox.pdfparser.BaseParser
Empty COSName at offset 628452
ERROR [Thread-21] 11:57:37,566 org.apache.tika.server.core.ServerStatusWatcher
Timeout task PARSE, millis elapsed 30014; consider increasing the allowable
time with the parameter or the X-Tika-Timeout-Millis header
WARN [Thread-21] 11:57:37,573 org.apache.tika.server.core.ServerStatusWatcher
forked process observed TIMEOUT and is shutting down.
INFO [Thread-21] 11:57:37,613 org.apache.tika.server.core.ServerStatusWatcher
Shutting down forked