Github user jorgelbg commented on a diff in the pull request:
https://github.com/apache/nutch/pull/55#discussion_r39860853
--- Diff:
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
---
@@ -193,34 +197,54 @@ public HttpResponse(HttpBase http, URL url,
CrawlDatum datum)
reqStr.append("\r\n");
if (http.isIfModifiedSinceEnabled() && datum.getModifiedTime() > 0) {
- reqStr.append("If-Modified-Since: "
- + HttpDateFormat.toString(datum.getModifiedTime()));
+ reqStr.append("If-Modified-Since: " + HttpDateFormat
+ .toString(datum.getModifiedTime()));
reqStr.append("\r\n");
}
reqStr.append("\r\n");
+ // store the request in the metadata?
+ if (conf.getBoolean("store.http.request", false) == true) {
+ headers.add("_request_", reqStr.toString());
+ }
+
byte[] reqBytes = reqStr.toString().getBytes();
req.write(reqBytes);
req.flush();
PushbackInputStream in = // process response
- new PushbackInputStream(new
BufferedInputStream(socket.getInputStream(),
- Http.BUFFER_SIZE), Http.BUFFER_SIZE);
+ new PushbackInputStream(
+ new BufferedInputStream(socket.getInputStream(),
+ Http.BUFFER_SIZE), Http.BUFFER_SIZE);
StringBuffer line = new StringBuffer();
+ // store the http headers verbatim
+ if (conf.getBoolean("store.http.headers", false) == true) {
+ httpHeaders = new StringBuffer();
+ }
+
+ headers.add("nutch.fetch.time", Long.toString(datum.getFetchTime()));
--- End diff --
Though of using the `System.currentTimeMillis()` method but reviewing this
comment https://github.com/apache/nutch/pull/55#issuecomment-140663159 though
that it will had the right value.
The comment on the `getFetchTime()` method says:
> Returns either the time of the last fetch, or the next fetch time,
> depending on whether Fetcher or CrawlDbReducer set the time.
Since this is executed in the HttpResponse class (right after the fetcher
gets executed) I though it would be save to assume that the date would be
accurate. If this is wrong the fix is easy enough.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---