hi, I don't think that nutch can be configured to store each downloaded file as a file (one file downloaded - one file on your local disk). The "byte array called content" can be directly stored I think. I think that's worth giving it a try. The fetcher uses (binary) streams to handle the downloaded content, so I think it *should* be okay.
Another approach (my two cents): 1. Run the fetcher with the -noParse option (most likely not even necessary) 2. check if the fetcher is advised to store the content (there is a property in nutch-default.xml) 3. create a dump with the "readseg" command and the "-dump" option 4. process the dump file and cut out what is necessary Just interested if that could work . . . however: I had a look at the class implementing the readseg command and found that the dump file is created with a "PrintWriter". This will create trouble I think. Maybe you can modify the SegmentReader (use an OutputStream). Regarding the fetcher - it's using a binary stream to store the content (FSDataOutputStream). Cheers, Martin On 9/11/07, eyal edri <[EMAIL PROTECTED]> wrote: > > Hi, > > I've asked this question before on a different mail list, with no real > response. > I hope someone saw the need for this actions and could help. > > I'm trying to config nutch to download certain file types (exe/zip) to the > file system while crawling. > I know nutch doesn't have a parse-exe plugin, so i'll focus on the ZIP > (once > i will understand the logic, i will write a parse-exe plugin). > > I want to know if nutch supports the downloading of files inherently > (using > only conf files) or if not, how can i alter the parse-zip plugin in order > to > download the file. > (i saw the parser gets a byte array called "content", can i save this to > the > fs ?). > > thanks, > > > -- > Eyal Edri >
