sebastian-nagel commented on code in PR #919:
URL: https://github.com/apache/nutch/pull/919#discussion_r3369518641
##########
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java:
##########
@@ -170,7 +170,8 @@ public BaseRobotRules getRobotRulesSet(Protocol http, URL
url,
LOG.debug("Following robots.txt redirect: {} -> {}",
robotsUrlRedir,
redirectionLocation);
try {
- robotsUrlRedir = new URL(robotsUrlRedir, redirectionLocation);
+ robotsUrlRedir = ((HttpBase) http).resolveUrl(robotsUrlRedir,
Review Comment:
Since the Protocol class provides a default implementation (a fall-back to
the Java URL class), there should be no cast to HttpBase necessary to call the
overridden method.
##########
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java:
##########
@@ -121,6 +121,17 @@ public class HttpResponse implements Response {
client.getParams().setParameter("http.useragent", http.getUserAgent());
// NUTCH-1941
code = client.executeMethod(get);
+ // When followRedirects=true HC3 walks the redirect chain internally;
+ // getURI() returns the final URI. Capture it so getUrl() honors the
+ // contract — without this, robots.txt redirects via this plugin
+ // report the original URL even though a different URL was fetched.
+ try {
+ this.url = new URL(get.getURI().toString());
Review Comment:
Good catch. Of course, it would also affect other fetches, not only the
robots.txt - although the robots.txt fetch is the only occurrence where
`getResponse` is called with `followRedirects = true`.
Some protocol implementations, such as protocol-http or protocol-okhttp,
ignore the parameter `followRedirects` anyway, and delegate the control over
redirects to the caller.
##########
src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/OkHttpResponse.java:
##########
@@ -102,7 +102,14 @@ public OkHttpResponse(OkHttp okhttp, URL url, CrawlDatum
datum)
}
Request request = rb.build();
- okhttp3.Call call = okhttp.getClient(url).newCall(request);
+
+ // OkHttp parsed the URL via HttpUrl; that's the form actually going on
+ // the wire (IDN→punycode, repeated-slash repair, host lowercasing).
+ this.url = request.url().url();
+ if (LOG.isDebugEnabled() && !this.url.toString().equals(url.toString())) {
Review Comment:
Good not to call
[URL.equals(url)](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/net/URL.html#equals(java.lang.Object))
which would trigger a DNS lookup.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]