sebastian-nagel commented on code in PR #1843:
URL: https://github.com/apache/stormcrawler/pull/1843#discussion_r2999183668


##########
docs/src/main/asciidoc/configuration.adoc:
##########
@@ -180,6 +179,7 @@ is defined.
 | fetcher.server.min.delay | 0 | Delay between crawls for queues with >1 
thread. Ignores robots.txt.
 | fetcher.threads.number | 10 | Total concurrent threads fetching pages. 
Adjust carefully based on system capacity.
 | fetcher.threads.per.queue | 1 | Default number of threads per queue. Can be 
overridden.
+| fetcher.threads.start.delay | 10 | Delay (seconds) before starting fetcher 
threads.

Review Comment:
   (in milliseconds)
   
   > Delay (milliseconds) between starting next fetcher thread. Avoids that DNS 
or network resources are overloaded during fetcher startup when all threads 
simultaneously start requesting the first pages.



##########
docs/src/main/asciidoc/configuration.adoc:
##########
@@ -196,20 +196,26 @@ implementation.
 | http.proxy.pass | - | Proxy password.
 | http.proxy.port | 8080 | Proxy port.
 | http.proxy.user | - | Proxy username.
-| http.robots.403.allow | true | Defines behavior when robots.txt returns HTTP 
403.
+| http.retry.on.connection.failure | true | Retry fetching on connection 
failure.

Review Comment:
   `http.retry.on.connection.failure` is supported only by the OkHttp protocol. 
Maybe link to 
<https://square.github.io/okhttp/5.x/okhttp/okhttp3/-ok-http-client/-builder/retry-on-connection-failure.html>?



##########
docs/src/main/asciidoc/configuration.adoc:
##########
@@ -196,20 +196,26 @@ implementation.
 | http.proxy.pass | - | Proxy password.
 | http.proxy.port | 8080 | Proxy port.
 | http.proxy.user | - | Proxy username.
-| http.robots.403.allow | true | Defines behavior when robots.txt returns HTTP 
403.
+| http.retry.on.connection.failure | true | Retry fetching on connection 
failure.
+| http.robots.403.allow | true | Allow crawling when robots.txt returns HTTP 
403.
+| http.robots.5xx.allow | false | Allow crawling when robots.txt returns a 
server error (5xx).
 | http.robots.agents | '' | Additional user-agent strings for interpreting 
robots.txt.
-| http.robots.file.skip | false | Ignore robots.txt rules (1.17+).
+| http.robots.content.limit | -1 | Maximum bytes to fetch for robots.txt. -1 
uses http.content.limit.
+| http.robots.file.skip | false | Ignore robots.txt rules entirely.
+| http.robots.headers.skip | false | Ignore robots directives from HTTP 
headers.
+| http.robots.meta.skip | false | Ignore robots directives from HTML meta tags.
 | http.skip.robots | false | Deprecated (replaced by http.robots.file.skip).
+| robots.noFollow.strict | true | If true, remove all outlinks from pages 
marked as noFollow.
 | http.store.headers | false | Whether to store response headers.
-| http.store.responsetime | true | Not yet implemented — store response time 
in Metadata.
 | http.timeout | 10000 | Connection timeout (ms).
 | http.use.cookies | false | Use cookies in subsequent requests.
 | https.protocol.implementation | 
org.apache.stormcrawler.protocol.httpclient.HttpProtocol | HTTPS Protocol
 implementation.
 | partition.url.mode | byHost | Defines how URLs are partitioned: byHost, 
byDomain, or byIP.
-| protocols | http,https | Supported protocols.
-| redirections.allowed | true | Allow URL redirects.
+| protocols | http,https,file | Supported protocols.
+| http.allow.redirects | false | Allow URL redirects.

Review Comment:
   Maybe we should add a dedecitated subsection to "protocols" which explains 
the behavior regarding redirects:
   
   ### Following Redirects
   When following HTTP redirects you have three options:
   1. By default, StormCrawler emits the redirect target URL to the status 
stream. URL filter and normalization rules are applied to the target URLs, the 
crawler verifies that the target URL is allowed per robots.txt, and it is 
ensured that the redirect target is not fetched multiple times (URLs are 
deduplicated in the status index).
   2.  If `redirections.allowed` is false, the redirect target URLs are not 
sent to the status stream. That is redirects are ignored.
   3. Redirects are followed immediately in the HTTP client and the target URLs 
not emitted to the status stream. This is the default behavior for 
browser-based protocols (Selenium and Playwright), but it's also supported by 
the OkHttp protocol if `http.allow.redirects` is set to true.
   



##########
docs/src/main/asciidoc/configuration.adoc:
##########
@@ -196,20 +196,26 @@ implementation.
 | http.proxy.pass | - | Proxy password.
 | http.proxy.port | 8080 | Proxy port.
 | http.proxy.user | - | Proxy username.
-| http.robots.403.allow | true | Defines behavior when robots.txt returns HTTP 
403.
+| http.retry.on.connection.failure | true | Retry fetching on connection 
failure.
+| http.robots.403.allow | true | Allow crawling when robots.txt returns HTTP 
403.
+| http.robots.5xx.allow | false | Allow crawling when robots.txt returns a 
server error (5xx).
 | http.robots.agents | '' | Additional user-agent strings for interpreting 
robots.txt.
-| http.robots.file.skip | false | Ignore robots.txt rules (1.17+).
+| http.robots.content.limit | -1 | Maximum bytes to fetch for robots.txt. -1 
uses http.content.limit.
+| http.robots.file.skip | false | Ignore robots.txt rules entirely.
+| http.robots.headers.skip | false | Ignore robots directives from HTTP 
headers.
+| http.robots.meta.skip | false | Ignore robots directives from HTML meta tags.
 | http.skip.robots | false | Deprecated (replaced by http.robots.file.skip).
+| robots.noFollow.strict | true | If true, remove all outlinks from pages 
marked as noFollow.
 | http.store.headers | false | Whether to store response headers.
-| http.store.responsetime | true | Not yet implemented — store response time 
in Metadata.
 | http.timeout | 10000 | Connection timeout (ms).
 | http.use.cookies | false | Use cookies in subsequent requests.
 | https.protocol.implementation | 
org.apache.stormcrawler.protocol.httpclient.HttpProtocol | HTTPS Protocol
 implementation.
 | partition.url.mode | byHost | Defines how URLs are partitioned: byHost, 
byDomain, or byIP.
-| protocols | http,https | Supported protocols.
-| redirections.allowed | true | Allow URL redirects.
+| protocols | http,https,file | Supported protocols.
+| http.allow.redirects | false | Allow URL redirects.

Review Comment:
   > (OkHttp only) Follow HTTP redirects immediately in the HTTP protocol 
client. Note: if followed immediately, redirect target URLs are not emitted to 
the status stream, are not filtered, not deduplicated, not checked whether 
allowed per robots.txt, etc.
   
   
   



##########
docs/src/main/asciidoc/configuration.adoc:
##########
@@ -196,20 +196,26 @@ implementation.
 | http.proxy.pass | - | Proxy password.
 | http.proxy.port | 8080 | Proxy port.
 | http.proxy.user | - | Proxy username.
-| http.robots.403.allow | true | Defines behavior when robots.txt returns HTTP 
403.
+| http.retry.on.connection.failure | true | Retry fetching on connection 
failure.
+| http.robots.403.allow | true | Allow crawling when robots.txt returns HTTP 
403.
+| http.robots.5xx.allow | false | Allow crawling when robots.txt returns a 
server error (5xx).
 | http.robots.agents | '' | Additional user-agent strings for interpreting 
robots.txt.
-| http.robots.file.skip | false | Ignore robots.txt rules (1.17+).
+| http.robots.content.limit | -1 | Maximum bytes to fetch for robots.txt. -1 
uses http.content.limit.
+| http.robots.file.skip | false | Ignore robots.txt rules entirely.
+| http.robots.headers.skip | false | Ignore robots directives from HTTP 
headers.
+| http.robots.meta.skip | false | Ignore robots directives from HTML meta tags.
 | http.skip.robots | false | Deprecated (replaced by http.robots.file.skip).
+| robots.noFollow.strict | true | If true, remove all outlinks from pages 
marked as noFollow.
 | http.store.headers | false | Whether to store response headers.
-| http.store.responsetime | true | Not yet implemented — store response time 
in Metadata.
 | http.timeout | 10000 | Connection timeout (ms).
 | http.use.cookies | false | Use cookies in subsequent requests.
 | https.protocol.implementation | 
org.apache.stormcrawler.protocol.httpclient.HttpProtocol | HTTPS Protocol
 implementation.
 | partition.url.mode | byHost | Defines how URLs are partitioned: byHost, 
byDomain, or byIP.
-| protocols | http,https | Supported protocols.

Review Comment:
   Defined in StatusEmitterBolt and used by derived classes (FetcherBolt, etc.):
   ```
   | redirections.allowed | true | If true emit redirect target URLs as 
"outlinks" to the status stream. If false, do not follow redirects. See also 
`http.allow.redirects`.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to