Re: [Nutch-dev] Getting HTML source

Hasan Diwan Mon, 25 Apr 2005 09:13:45 -0700

On 23/04/05, rajat swarup <[EMAIL PROTECTED]> wrote:
> Most of the methods in the code return Page objects in the code. But
> looking at the Page class definitions I found that there were no
> fields in the Page class that would give me access to the actual HTML
> source code or the parsed data inside the HTML page.
> 
> Is there a place in the Nutch source code where we can get the HTML
> source code (or maybe just the textual content of the pages) given the
> URL or may be the Page object?


org.apache.nutch.db.Page.write writes everything out to a
DataOutputStream. Also, the Page object has accessors. I do not see a
method to get the page source. It looks like there is also a getPage
method in org.apache.nutch.db.DBSectionReader. Hope this helps...
-- 
Cheers,
Hasan Diwan <[EMAIL PROTECTED]>


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_ide95&alloc_id396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Getting HTML source

Reply via email to