The file src/java/org/apache/nutch/fetcher/Fetcher.java has the following lines

-------------------------------------------------------------------
260       if (status.isSuccess()) {
261         outputPage(new FetcherOutput(fle, hash, protocolStatus),
262 content, new ParseText(parse.getText()), parse.getData());
263       }
-------------------------------------------------------------------

where hash is

-------------------------------------------------------------------
233       Content content = output.getContent();
234       MD5Hash hash = null;
235       String url = fle.getPage().getURL().toString();
236       if (content == null) {
237 content = new Content(url, url, new byte[0], "", new Properties());
238         hash = MD5Hash.digest(url);
239       } else {
240         hash = MD5Hash.digest(content.getContent());
241       }
-------------------------------------------------------------------

Its a little late right now and perhaps I'm asking a nieve questions, if the parse is successful on non-null content, what would be the by-product of changing the content hash from

hash = MD5Hash.digest(content.getContent());

to the hash being the MD5Digest of parse.getText().

Thoughts?




Reply via email to