Re: [Nutch-dev] Getting HTML source

Hasan Diwan Mon, 25 Apr 2005 09:10:57 -0700

On 23/04/05, rajat swarup <[EMAIL PROTECTED]> wrote:
> Most of the methods in the code return Page objects in the code. But
> looking at the Page class definitions I found that there were no
> fields in the Page class that would give me access to the actual HTML
> source code or the parsed data inside the HTML page.
> 
> Is there a place in the Nutch source code where we can get the HTML
> source code (or maybe just the textual content of the pages) given the
> URL or may be the Page object?


org.apache.nutch.db.Page.write writes everything out to a
DataOutputStream. Also, the Page object has accessors. I do not see a
method to get the page source. It looks like there is also a getPage
method in org.apache.nutch.db.DBSectionReader. Hope this helps...
-- 
Cheers,
Hasan Diwan <[EMAIL PROTECTED]>

Re: [Nutch-dev] Getting HTML source

Reply via email to