Hey eyal,
There is no parser for "application/x-dosexec". You will have to write
plugin for to parse exe files (have a look @ parse-zip plugin).
Storing unzipped contents :
Option 1:
I think u can modify parse-zip plugin's ZipParser class to
store the
unzipped contents at some desired location
Option 2:
Or write a separate job to get parse-text contents and store
@ some desired
location
- Sagar Naik
eyal edri wrote:
Hi,
I am trying to use the nutch fetcher for d/l EXE/ZIP files from web pages.
i've removed the suffixes from the regex-urlfilter &
automation-urlfilter(files identical):
regex-urlfilter.txt:
--------------------------------------------------------------------------------------------------------
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|jpeg|JPEG|bmp|BMP|iso|ISO|bin|BIN)$
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/
# accept anything else
+.
------------------------------------------------------------------------------------------------------------------
When trying to download EXE:
http://www.xtodvd.com/apodvdcopy.exe
the fetch fails:
found segment crawl/segments/20070902084928
Fetching now the urls..
Fetcher: starting
Fetcher: segment: crawl/segments/20070902084928
Fetcher: threads: 1000
fetching http://www.xtodvd.com/apodvdcopy.exe
Error parsing: http://createdvd.net/apodvdcopy.exe: failed(2,200):
org.apache.nutch.parse.ParseException : parser not found for
contentType=application/x-dosexec url=http://createdvd.net/apodvdcopy.exe
Fetcher: done
when trying to fetch Zip file, its works, but how can i tell him to save the
zip to a folder in a directory on the file system, do i need to write a
plugin?
thanks!
--
This message has been scanned for viruses and
dangerous content and is believed to be clean.