Re: API for injecting content into Nutch?

Matt Kangas Mon, 26 Sep 2005 12:48:42 -0700

Dave, you don't want to "inject" anything per-se, at least accordingto nutch terminology. Instead, you'll want create your own syntheticcrawler. Nutch's crawler outputs one "segment file" (directory offiles, actually) per crawler pass. It is this segment that isprocessed by the "nutch index" stage.

So, create a program that iterates through your content and writes itto a segment file, simulating the crawler's output. Just read thesource for Fetcher.java to see how it usesorg.apache.nutch.segment.SegmentWriter and mimic that. Then followthe rest of the tutorial as if your segment files had fallen out ofthe real crawler.


--Matt

On Sep 26, 2005, at 2:32 PM, Goldschmidt, Dave wrote:

Hello,
Is there an API of some sort for injecting content into Nutch*without*using Nutch's crawler? Or does anyone have ideas as to how toapproach
this problem?  I.e. given a URL, a page of content, metadata about the
page, links, etc., how can I inject this into Nutch without Nutch
performing the crawl?

Thanks in advance for your ideas and insights,


DaveG


--
Matt Kangas / [EMAIL PROTECTED]

Re: API for injecting content into Nutch?

Reply via email to