Hey eyal,

There is no parser for "application/x-dosexec". You will have to write
plugin for to parse exe files (have a look @ parse-zip plugin).

Storing unzipped contents :
   Option 1:
I think u can modify parse-zip plugin's ZipParser class to store the
           unzipped contents at some desired location
   Option 2:
Or write a separate job to get parse-text contents and store @ some desired
           location
- Sagar Naik


eyal edri wrote:
Hi,

I am trying to use the nutch fetcher for d/l EXE/ZIP files from web pages.
i've removed the suffixes from the regex-urlfilter &
automation-urlfilter(files identical):


regex-urlfilter.txt:
--------------------------------------------------------------------------------------------------------
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|jpeg|JPEG|bmp|BMP|iso|ISO|bin|BIN)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept anything else
+.

------------------------------------------------------------------------------------------------------------------

When trying to download EXE:
http://www.xtodvd.com/apodvdcopy.exe

the fetch fails:
found segment crawl/segments/20070902084928
Fetching now the urls..
Fetcher: starting
Fetcher: segment: crawl/segments/20070902084928
Fetcher: threads: 1000
fetching http://www.xtodvd.com/apodvdcopy.exe
Error parsing: http://createdvd.net/apodvdcopy.exe: failed(2,200):
org.apache.nutch.parse.ParseException : parser not found for
contentType=application/x-dosexec url=http://createdvd.net/apodvdcopy.exe
Fetcher: done

when trying to fetch Zip file, its works, but how can i tell him to save the
zip to a folder in a directory on the file system, do i need to write a
plugin?

thanks!




--
This message has been scanned for viruses and
dangerous content and is believed to be clean.

Reply via email to