Hi, We are working on a project where the actual text content of the pages would be used to decide the topical relevance of the pages (implementing "Focused crawling" in Nutch). Most of the methods in the code return Page objects in the code. But looking at the Page class definitions I found that there were no fields in the Page class that would give me access to the actual HTML source code or the parsed data inside the HTML page.
Is there a place in the Nutch source code where we can get the HTML source code (or maybe just the textual content of the pages) given the URL or may be the Page object? Thanks for any forthcoming help! -Rajat http://www-scf.usc.edu/~swarup/
