On 23/04/05, rajat swarup <[EMAIL PROTECTED]> wrote: > Most of the methods in the code return Page objects in the code. But > looking at the Page class definitions I found that there were no > fields in the Page class that would give me access to the actual HTML > source code or the parsed data inside the HTML page. > > Is there a place in the Nutch source code where we can get the HTML > source code (or maybe just the textual content of the pages) given the > URL or may be the Page object?
org.apache.nutch.db.Page.write writes everything out to a DataOutputStream. Also, the Page object has accessors. I do not see a method to get the page source. It looks like there is also a getPage method in org.apache.nutch.db.DBSectionReader. Hope this helps... -- Cheers, Hasan Diwan <[EMAIL PROTECTED]> ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_ide95&alloc_id396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
