Simon Detri wrote:

Hello,

After the crawl is done, I would like to query
the webdb for pages (by url), and i would like
to access the content of these pages.

I see that there is a method
WebDBReader.getUrl(String url) which returns a Page.
Is there a way to get the recno of this Page so
that i can retrieve the Content by doing something
like this:

// code from net.nutch.protocol.Content.java
File file = new File(segment, DIR_NAME);
ArrayFile.Reader contents = new
ArrayFile.Reader(file.toString());
Content content = new Content();
contents.get(recno, content);

The quickest way is to build an index with the IndexSegment tool, then you can find recno by searching for URL. That's how the NutchBean is able to retrieve copies of Content. Note however that you can have the same url in many segments, pointing to the same or different content (e.g. different versions). Ordinarily, after creating segment indexes you would run DeleteDuplicates to prune segments' data.


--
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)



-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to