On 23/04/05, rajat swarup <[EMAIL PROTECTED]> wrote: > Most of the methods in the code return Page objects in the code. But > looking at the Page class definitions I found that there were no > fields in the Page class that would give me access to the actual HTML > source code or the parsed data inside the HTML page. > > Is there a place in the Nutch source code where we can get the HTML > source code (or maybe just the textual content of the pages) given the > URL or may be the Page object?
org.apache.nutch.db.Page.write writes everything out to a DataOutputStream. Also, the Page object has accessors. I do not see a method to get the page source. It looks like there is also a getPage method in org.apache.nutch.db.DBSectionReader. Hope this helps... -- Cheers, Hasan Diwan <[EMAIL PROTECTED]>
