Github user jnioche commented on a diff in the pull request:

    https://github.com/apache/nutch/pull/55#discussion_r39863257
  
    --- Diff: 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
 ---
    @@ -193,34 +197,54 @@ public HttpResponse(HttpBase http, URL url, 
CrawlDatum datum)
           reqStr.append("\r\n");
     
           if (http.isIfModifiedSinceEnabled() && datum.getModifiedTime() > 0) {
    -        reqStr.append("If-Modified-Since: "
    -            + HttpDateFormat.toString(datum.getModifiedTime()));
    +        reqStr.append("If-Modified-Since: " + HttpDateFormat
    +            .toString(datum.getModifiedTime()));
             reqStr.append("\r\n");
           }
           reqStr.append("\r\n");
     
    +      // store the request in the metadata?
    +      if (conf.getBoolean("store.http.request", false) == true) {
    +        headers.add("_request_", reqStr.toString());
    +      }
    +
           byte[] reqBytes = reqStr.toString().getBytes();
     
           req.write(reqBytes);
           req.flush();
     
           PushbackInputStream in = // process response
    -      new PushbackInputStream(new 
BufferedInputStream(socket.getInputStream(),
    -          Http.BUFFER_SIZE), Http.BUFFER_SIZE);
    +          new PushbackInputStream(
    +              new BufferedInputStream(socket.getInputStream(),
    +                  Http.BUFFER_SIZE), Http.BUFFER_SIZE);
     
           StringBuffer line = new StringBuffer();
     
    +      // store the http headers verbatim
    +      if (conf.getBoolean("store.http.headers", false) == true) {
    +        httpHeaders = new StringBuffer();
    +      }
    +
    +      headers.add("nutch.fetch.time", Long.toString(datum.getFetchTime()));
    --- End diff --
    
    It is correct in the output of the fetcher step when accessing the fetch 
datum but I don't think it is the case at this point in the code
    
    > it is executed in the HttpResponse class (right after the fetcher gets 
executed)
    
    not after but right in the middle of the fetcher's work. 
    
    It is set to the right value in the output method of the fetcherthread 
[https://github.com/apache/nutch/blob/8397611b49de4aac408806765191fc796ba4b15f/src/java/org/apache/nutch/fetcher/FetcherThread.java#L528]
 but that's AFTER the protocol implementation fetched the content.
    
    In short not clear what the value is at this point of the code but it's 
unlikely to be correct. Just use System.currentTimeMillis()



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to