Dave, you don't want to "inject" anything per-se, at least according to nutch terminology. Instead, you'll want create your own synthetic crawler. Nutch's crawler outputs one "segment file" (directory of files, actually) per crawler pass. It is this segment that is processed by the "nutch index" stage.

So, create a program that iterates through your content and writes it to a segment file, simulating the crawler's output. Just read the source for Fetcher.java to see how it uses org.apache.nutch.segment.SegmentWriter and mimic that. Then follow the rest of the tutorial as if your segment files had fallen out of the real crawler.

--Matt

On Sep 26, 2005, at 2:32 PM, Goldschmidt, Dave wrote:

Hello,

Is there an API of some sort for injecting content into Nutch *without* using Nutch's crawler? Or does anyone have ideas as to how to approach
this problem?  I.e. given a URL, a page of content, metadata about the
page, links, etc., how can I inject this into Nutch without Nutch
performing the crawl?

Thanks in advance for your ideas and insights,


DaveG


--
Matt Kangas / [EMAIL PROTECTED]


Reply via email to