[ 
https://issues.apache.org/jira/browse/NUTCH-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18086730#comment-18086730
 ] 

ASF GitHub Bot commented on NUTCH-3173:
---------------------------------------

sebastian-nagel commented on code in PR #919:
URL: https://github.com/apache/nutch/pull/919#discussion_r3369518641


##########
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java:
##########
@@ -170,7 +170,8 @@ public BaseRobotRules getRobotRulesSet(Protocol http, URL 
url,
             LOG.debug("Following robots.txt redirect: {} -> {}", 
robotsUrlRedir,
                 redirectionLocation);
             try {
-              robotsUrlRedir = new URL(robotsUrlRedir, redirectionLocation);
+              robotsUrlRedir = ((HttpBase) http).resolveUrl(robotsUrlRedir,

Review Comment:
   Since the Protocol class provides a default implementation (a fall-back to 
the Java URL class), there should be no cast to HttpBase necessary to call the 
overridden method.



##########
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java:
##########
@@ -121,6 +121,17 @@ public class HttpResponse implements Response {
       client.getParams().setParameter("http.useragent", http.getUserAgent()); 
// NUTCH-1941
       code = client.executeMethod(get);
 
+      // When followRedirects=true HC3 walks the redirect chain internally;
+      // getURI() returns the final URI. Capture it so getUrl() honors the
+      // contract — without this, robots.txt redirects via this plugin
+      // report the original URL even though a different URL was fetched.
+      try {
+        this.url = new URL(get.getURI().toString());

Review Comment:
   Good catch. Of course, it would also affect other fetches, not only the 
robots.txt - although the robots.txt fetch is the only occurrence where 
`getResponse` is called with `followRedirects = true`.
   
   Some protocol implementations, such as protocol-http or protocol-okhttp, 
ignore the parameter `followRedirects` anyway, and delegate the control over 
redirects to the caller.



##########
src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/OkHttpResponse.java:
##########
@@ -102,7 +102,14 @@ public OkHttpResponse(OkHttp okhttp, URL url, CrawlDatum 
datum)
     }
 
     Request request = rb.build();
-    okhttp3.Call call = okhttp.getClient(url).newCall(request);
+
+    // OkHttp parsed the URL via HttpUrl; that's the form actually going on
+    // the wire (IDN→punycode, repeated-slash repair, host lowercasing).
+    this.url = request.url().url();
+    if (LOG.isDebugEnabled() && !this.url.toString().equals(url.toString())) {

Review Comment:
   Good not to call 
[URL.equals(url)](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/net/URL.html#equals(java.lang.Object))
 which would trigger a DNS lookup.





> protocol-okhttp: store OkHttp's internal URL in response metadata
> -----------------------------------------------------------------
>
>                 Key: NUTCH-3173
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3173
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin, protocol
>    Affects Versions: 1.23
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.23
>
>
> OkHttp uses its 
> [HttpUrl|https://square.github.io/okhttp/5.x/okhttp/okhttp3/-http-url/index.html]
>  for HTTP requests. There are some differences between HttpURl and 
> java.net.URL resp. java.net.URI. And the HttpUrl.parse may parse a URL string 
> differently than Java's URL class.
> It would be good to store the stringified HttpUrl in the response metadata, 
> at least, if it differs from the original URL string. The 
> [Request|https://square.github.io/okhttp/5.x/okhttp/okhttp3/-request/index.html]
>  holds the HttpUrl object.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to