Re: how to parse html files while crawling

Ankit Dangi Wed, 21 Apr 2010 04:34:26 -0700

To convert the Nutch's crawled data which is stored in segments to human
readable and interpretable forms, you will have to look at the 'segread'
command (which was earlier 'readseg'). It reads and exports the segment
data.


Details at Nutch Wiki:
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_segread

- Ankit Dangi


On Mon, Apr 19, 2010 at 9:15 PM, nachonieto3 <jinietosanc...@gmail.com>wrote:

>
> I have a doubt related with this topic (I guess)...How are the final
> results
> of Nutch stored?I mean, in which format is stored the information contained
> in the links analyzed?
>
> I understood that Nutch need the information in plan text to parse it...but
> in which format is stored finally?I know is stored in "segments" but how
> can
> I access to this information in order to convert it to plan text?Is it
> possible?
>
> Thank you in advance
> --
> View this message in context:
> http://n3.nabble.com/how-to-parse-html-files-while-crawling-tp706816p729943.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
Ankit Dangi

Re: how to parse html files while crawling

Reply via email to