Re: [Nutch-dev] Getting HTML source

Piotr Kosiorowski Tue, 26 Apr 2005 05:59:54 -0700

Hello,

Page object does not contain html page content. To access fetched page content you have to iterate over segment data and extract it from there. Please have a look at SegmentReader class - it gives you a simple API to access all segment data. Regards Piotr

Hasan Diwan wrote:

On 23/04/05, rajat swarup <[EMAIL PROTECTED]> wrote:

Most of the methods in the code return Page objects in the code. But
looking at the Page class definitions I found that there were no
fields in the Page class that would give me access to the actual HTML
source code or the parsed data inside the HTML page.

Is there a place in the Nutch source code where we can get the HTML
source code (or maybe just the textual content of the pages) given the
URL or may be the Page object?

org.apache.nutch.db.Page.write writes everything out to a
DataOutputStream. Also, the Page object has accessors. I do not see a
method to get the page source. It looks like there is also a getPage
method in org.apache.nutch.db.DBSectionReader. Hope this helps...


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Getting HTML source

Reply via email to