rzo1 commented on code in PR #1843: URL: https://github.com/apache/stormcrawler/pull/1843#discussion_r2999827818
########## docs/src/main/asciidoc/configuration.adoc: ########## @@ -196,20 +196,26 @@ implementation. | http.proxy.pass | - | Proxy password. | http.proxy.port | 8080 | Proxy port. | http.proxy.user | - | Proxy username. -| http.robots.403.allow | true | Defines behavior when robots.txt returns HTTP 403. +| http.retry.on.connection.failure | true | Retry fetching on connection failure. +| http.robots.403.allow | true | Allow crawling when robots.txt returns HTTP 403. +| http.robots.5xx.allow | false | Allow crawling when robots.txt returns a server error (5xx). | http.robots.agents | '' | Additional user-agent strings for interpreting robots.txt. -| http.robots.file.skip | false | Ignore robots.txt rules (1.17+). +| http.robots.content.limit | -1 | Maximum bytes to fetch for robots.txt. -1 uses http.content.limit. +| http.robots.file.skip | false | Ignore robots.txt rules entirely. +| http.robots.headers.skip | false | Ignore robots directives from HTTP headers. +| http.robots.meta.skip | false | Ignore robots directives from HTML meta tags. | http.skip.robots | false | Deprecated (replaced by http.robots.file.skip). +| robots.noFollow.strict | true | If true, remove all outlinks from pages marked as noFollow. | http.store.headers | false | Whether to store response headers. -| http.store.responsetime | true | Not yet implemented — store response time in Metadata. | http.timeout | 10000 | Connection timeout (ms). | http.use.cookies | false | Use cookies in subsequent requests. | https.protocol.implementation | org.apache.stormcrawler.protocol.httpclient.HttpProtocol | HTTPS Protocol implementation. | partition.url.mode | byHost | Defines how URLs are partitioned: byHost, byDomain, or byIP. -| protocols | http,https | Supported protocols. -| redirections.allowed | true | Allow URL redirects. +| protocols | http,https,file | Supported protocols. +| http.allow.redirects | false | Allow URL redirects. Review Comment: Did that now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
