sebastian-nagel opened a new pull request, #1900: URL: https://github.com/apache/stormcrawler/pull/1900
To force cancellation of the request: - set [OkHttp call timeout](https://square.github.io/okhttp/5.x/okhttp/okhttp3/-ok-http-client/-builder/call-timeout.html) to topology.message.timeout.secs (if not -1) Additional changes: - set the TrimmedReason to `TIME` if OkHttp throws an InterruptedIOException - log the reason why the response is trimmed - add type parameter to MutableObject's - replace deprecated method calls `getValue()` So far, the solution is only verified using the Protocol main method: ``` $> java -cp .../stormcrawler-core-3.5.2-SNAPSHOT.jar:... \ org.apache.stormcrawler.protocol.okhttp.HttpProtocol \ -f /tmp/crawler-conf-test.yaml http://cbhjhlccfkqdpknyu.org/ ... [main] INFO org.apache.stormcrawler.protocol.okhttp.HttpProtocol - Using protocol versions: [h2, http/1.1] [main] INFO org.apache.stormcrawler.protocol.okhttp.HttpProtocol - Using connection pool with max. 5 idle connections and 300 sec. connection keep-alive time [Thread-0] WARN org.apache.stormcrawler.protocol.okhttp.HttpProtocol - HTTP content trimmed to 10 (reason: TIME) [Thread-0] WARN crawlercommons.robots.SimpleRobotRulesParser - Problem processing robots.txt for http://cbhjhlccfkqdpknyu.org/ [Thread-0] WARN crawlercommons.robots.SimpleRobotRulesParser - Unknown line in robots.txt file (size 10): DQEPigDriE [Thread-0] WARN org.apache.stormcrawler.protocol.okhttp.HttpProtocol - HTTP content trimmed to 10 (reason: TIME) http://cbhjhlccfkqdpknyu.org/ robots allowed: true robots requests: 1 sitemaps identified: 0 date: Thu, 07 May 2026 14:02:24 GMT server: nginx/1.21.6 transfer-encoding: chunked _protocol_versions_: http/1.1 metrics.dns.resolution.msec: 4 http.trimmed.reason: time keep-alive: timeout=20 _request.headers_: GET / HTTP/1.1 User-Agent: MyTestBot/3.0 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3 Accept-Encoding: zstd, br, gzip Host: cbhjhlccfkqdpknyu.org Connection: Keep-Alive http.trimmed: true _request.time_: 1778162544171 content-type: application/octet-stream connection: keep-alive _response.ip_: 216.218.185.162 _response.headers_: HTTP/1.1 200 OK Server: nginx/1.21.6 Date: Thu, 07 May 2026 14:02:24 GMT Content-Type: application/octet-stream Transfer-Encoding: chunked Connection: keep-alive Keep-Alive: timeout=20 status code: 200 content length: 10 fetched in : 60002 msec ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
