Hi all.

I have a problem in config nutch-default.xml. As I am in China, most ftp sites 
that I want to crawl are encoded in chinese, but when nutch crawl these ftp 
sites,it could not get the correct charset code,and the parse results are 
incomprehensible and useless. so I set <property>
 <name>parser.character.encoding.default</name>
 <value>windows-1252</value>
 </property>
to <value>gb2312</value> and got a very interesting result, nutch now can crawl 
the files and directories of the root directoy of chinese ftp sites without any 
messy characters,but can NOT crawl any files in SUBdirectories,just got a 
result :404 no found.
I know there must be something wrong in config files but how and where can I 
config nutch to crawl a chinese ftp site? 
I 've been working on this problem for halt a month and find no way to solve 
it, Could anyone helo me???

thanks

 

Reply via email to