[
https://issues.apache.org/jira/browse/NUTCH-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18086730#comment-18086730
]
ASF GitHub Bot commented on NUTCH-3173:
---------------------------------------
sebastian-nagel commented on code in PR #919:
URL: https://github.com/apache/nutch/pull/919#discussion_r3369518641
##########
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java:
##########
@@ -170,7 +170,8 @@ public BaseRobotRules getRobotRulesSet(Protocol http, URL
url,
LOG.debug("Following robots.txt redirect: {} -> {}",
robotsUrlRedir,
redirectionLocation);
try {
- robotsUrlRedir = new URL(robotsUrlRedir, redirectionLocation);
+ robotsUrlRedir = ((HttpBase) http).resolveUrl(robotsUrlRedir,
Review Comment:
Since the Protocol class provides a default implementation (a fall-back to
the Java URL class), there should be no cast to HttpBase necessary to call the
overridden method.
##########
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java:
##########
@@ -121,6 +121,17 @@ public class HttpResponse implements Response {
client.getParams().setParameter("http.useragent", http.getUserAgent());
// NUTCH-1941
code = client.executeMethod(get);
+ // When followRedirects=true HC3 walks the redirect chain internally;
+ // getURI() returns the final URI. Capture it so getUrl() honors the
+ // contract — without this, robots.txt redirects via this plugin
+ // report the original URL even though a different URL was fetched.
+ try {
+ this.url = new URL(get.getURI().toString());
Review Comment:
Good catch. Of course, it would also affect other fetches, not only the
robots.txt - although the robots.txt fetch is the only occurrence where
`getResponse` is called with `followRedirects = true`.
Some protocol implementations, such as protocol-http or protocol-okhttp,
ignore the parameter `followRedirects` anyway, and delegate the control over
redirects to the caller.
##########
src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/OkHttpResponse.java:
##########
@@ -102,7 +102,14 @@ public OkHttpResponse(OkHttp okhttp, URL url, CrawlDatum
datum)
}
Request request = rb.build();
- okhttp3.Call call = okhttp.getClient(url).newCall(request);
+
+ // OkHttp parsed the URL via HttpUrl; that's the form actually going on
+ // the wire (IDN→punycode, repeated-slash repair, host lowercasing).
+ this.url = request.url().url();
+ if (LOG.isDebugEnabled() && !this.url.toString().equals(url.toString())) {
Review Comment:
Good not to call
[URL.equals(url)](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/net/URL.html#equals(java.lang.Object))
which would trigger a DNS lookup.
> protocol-okhttp: store OkHttp's internal URL in response metadata
> -----------------------------------------------------------------
>
> Key: NUTCH-3173
> URL: https://issues.apache.org/jira/browse/NUTCH-3173
> Project: Nutch
> Issue Type: Improvement
> Components: plugin, protocol
> Affects Versions: 1.23
> Reporter: Sebastian Nagel
> Priority: Minor
> Fix For: 1.23
>
>
> OkHttp uses its
> [HttpUrl|https://square.github.io/okhttp/5.x/okhttp/okhttp3/-http-url/index.html]
> for HTTP requests. There are some differences between HttpURl and
> java.net.URL resp. java.net.URI. And the HttpUrl.parse may parse a URL string
> differently than Java's URL class.
> It would be good to store the stringified HttpUrl in the response metadata,
> at least, if it differs from the original URL string. The
> [Request|https://square.github.io/okhttp/5.x/okhttp/okhttp3/-request/index.html]
> holds the HttpUrl object.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)