Re: Downloading file types to file system

Martin Kuen Tue, 11 Sep 2007 06:35:06 -0700

hi,

I don't think that nutch can be configured to store each downloaded file as
a file (one file downloaded - one file on your local disk).
The "byte array called content" can be directly stored I think. I think
that's worth giving it a try. The fetcher uses (binary) streams to handle
the downloaded content, so I think it *should* be okay.

Another approach (my two cents):
 1. Run the fetcher with the -noParse option (most likely not even
necessary)
 2. check if the fetcher is advised to store the content (there is a
property in nutch-default.xml)
 3. create a dump with the "readseg" command and the "-dump" option
 4. process the dump file and cut out what is necessary

Just interested if that could work . . . however:
I had a look at the class implementing the readseg command and found that
the dump file is created with a "PrintWriter". This will create trouble I
think. Maybe you can modify the SegmentReader (use an OutputStream).
Regarding the fetcher - it's using a binary stream to store the content
(FSDataOutputStream).

Cheers,

Martin

On 9/11/07, eyal edri <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> I've asked this question before on a different mail list, with no real
> response.
> I hope someone saw the need for this actions and could help.
>
> I'm trying to config nutch to download certain file types (exe/zip) to the
> file system while crawling.
> I know nutch doesn't have a parse-exe plugin, so i'll focus on the ZIP
> (once
> i will understand the logic, i will write a parse-exe plugin).
>
> I want to know if nutch supports the downloading of files inherently
> (using
> only conf files) or if not, how can i alter the parse-zip plugin in order
> to
> download the file.
> (i saw the parser gets a byte array called "content", can i save this to
> the
> fs ?).
>
> thanks,
>
>
> --
> Eyal Edri
>

Re: Downloading file types to file system

Reply via email to