Hi Jason, That looks correct, Fetcher.outputPage(...) writes FetcherOutput to disk via ArrayFile.Writter instance.
Otis ____________________________________________________________________ Simpy -- simpy.com -- tags, social bookmarks, personal search engine --- Jason Manfield <[EMAIL PROTECTED]> wrote: > Otis > > Thanks for the pointer. > > I suppose the Fetcher.java is the core guy reading contents from the > URLs and dumping it to different directories in the filesystem (via > Fetcher.outputPage), right? In that case, can this be intercepted > (via my code changes locally) to dump the extracted contents into our > proprietary system? Are the segments created as part of the Fetcher > or before the call to the Fetcher? > > Thanks > > Jason > > > [EMAIL PROTECTED] wrote: > Jason - this is perfectly doable -- I do this for my social > bookmarking > project, Simpy.com > > I think people tend to run Nutch using the nutch shell script that > comes with Nutch, but you can really call the Fetcher Java class > directly and programmatically yourself, as it has the main method. > You > can do the same with the SegmentMergeTool. So, if you can write a > Java > app, just call Nutch's Java classes the same way that the shell > script > does. > > I can't help you with reading Nutch's files with C#, but the source > is > there, so you should be able to write file readers in C#. > > Otis > ____________________________________________________________________ > Simpy -- simpy.com -- tags, social bookmarks, personal search engine > > > > --- Jason Manfield wrote: > > We would like to use nutch just for crawling, and then index the > > crawled database into our proprietory datastore/index. How do we go > > about this? I see that nutch is a shell script, so it is possible > to > > just crawl. Once it crawls, I suppose the crawled data is dumped > into > > webdb. Are there exposed APIs to extract the data from webdb? > > > > One more catch -- our company is a .NET shop :((, so we would like > to > > use C# to read the data of the fetched/crawled pages for further > > indexing. > > > > Ideas/suggestions? > > > > Any plans to have nutch for .NET (like dotLucene)?
