Re: where nutch store crawled data

Marcus Herou Tue, 17 Jun 2008 11:01:01 -0700

Oh sorry just saw CrawlDbReader which have different methods, one in
particular for retrieving content based on a url.


//Marcus

On Tue, Jun 17, 2008 at 7:57 PM, Marcus Herou <[EMAIL PROTECTED]>
wrote:

> You can fetch it but it is not pretty.
>
> It is just a SequenceFileInputFormat:
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html
>
> Look in the org.apache.nutch.crawl.Crawl class and specifically how it uses
> the Indexer.
>
> Kindly
>
> //Marcus
>
>
> On Tue, Jun 17, 2008 at 3:57 PM, beansproud <[EMAIL PROTECTED]>
> wrote:
>
>>
>> oh, you are right.
>> thanks
>>
>>
>> POIRIER David wrote:
>> >
>> > When executing a crawl, Nutch creates segments, based on the crawel
>> > depth if I'm not mistaking, in which the fetched content is stored. For
>> > example, if crawling a web site named site-xyz, into the directory
>> > $nutch_home/crawls/crawl-xyz, you will find the segments into the
>> > following directory: $nutch_home/crawls/crawl-xyz/segments. For each
>> > segment directory you will find a content directory.
>> >
>> > To be honest, I don't think you can directly access the stored content
>> > found in thoses directories, the idea being to index it and not
>> > necesserely store it.
>> >
>> > David
>> >
>> >
>> >
>> > -----Original Message-----
>> > From: beansproud [mailto:[EMAIL PROTECTED]
>> > Sent: lundi, 16. juin 2008 16:42
>> > To: [email protected]
>> > Subject: where nutch store crawled data
>> >
>> >
>> > Hi,
>> >     I'm fresh for nutch.And when I use nutch for crawling pages.I can
>> > get
>> > the crawled data by using the command : nutch readseg.
>> >     My question is can I get the data directly ? I just can't find where
>> > nutch put them.
>> >     Can anybody tell me ?
>> >     Thanks very much!
>> > --
>> > View this message in context:
>> >
>> http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17865961
>> > .html
>> > Sent from the Nutch - User mailing list archive at Nabble.com.
>> >
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17905486.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> [EMAIL PROTECTED]
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/




-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[EMAIL PROTECTED]
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: where nutch store crawled data

Reply via email to