Such data is called "Content" (Java Class), it has HTTP headers, you can
check HTTP headers ("text/html", "jpg", etc.... MIME extensions)
It is stored in folder "content"...
You can extract it, and save in hard-drive as regular files, you need
specific code (Nutch does not have it). Simply extract
Check in Fetcher class, "ArrayFile.Writer contentWriter;"
You need also to modify url-filter files...
Content.getContentType() - MIME extension
byte[] Content.getContent() - file content (JPG, HTML, DOC, PDF, ...)
-----Original Message-----
From: Zhou LiBing [mailto:[EMAIL PROTECTED]
Sent: Saturday, August 20, 2005 7:59 PM
To: [EMAIL PROTECTED]
Subject: Re: [Nutch-general] Re: about the nutch function
can I specify more than one URL to crawl the whole web?are you sure? how
to edit the crawl-urlfilter.txt to fetch the images?how could I extract
these* segment *images' feature?
thank you
2005/8/19, Piotr Kosiorowski <[EMAIL PROTECTED]>:
>
> Yes.Yes. :).
> You can specify more than one url while injecting pages to WebDB. You
> can fetch the image file (you have to edit crawl-urlfilter.txt or
> regex-urlfilter.txt to allow particular extension as majority of image
> extensions are blocked by default). But such data would be only stored
> in segment - I do not think it would be accesible by search.
> P.
>
>
> On 8/19/05, Zhou LiBing <[EMAIL PROTECTED]> wrote:
> > Can Nutch use one or more start URL to crawl the WEB?
> > Can Nutch fetch the IMAGE file?
> > thank you
> >
> >
> > --
> > ---Letter From your friend Blue at HUST CGCL---
> >
> >
>
>
> -------------------------------------------------------
> SF.Net email is Sponsored by the Better Software Conference & EXPO
> September 19-22, 2005 * San Francisco, CA * Development Lifecycle
> Practices Agile & Plan-Driven Development * Managing Projects & Teams
> * Testing & QA Security * Process Improvement & Measurement *
> http://www.sqe.com/bsce5sf
> _______________________________________________
> Nutch-general mailing list [email protected]
> https://lists.sourceforge.net/lists/listinfo/nutch-general
>
--
---Letter From your friend Blue at HUST CGCL---