Related issue?
http://www.mail-archive.com/[email protected]/msg06135.html
[EMAIL PROTECTED] wrote:
Hi all.
I have a problem in config nutch-default.xml. As I am in China, most ftp sites that I
want to crawl are encoded in chinese, but when nutch crawl these ftp sites,it could
not get the correct charset code,and the parse results are incomprehensible and
useless. so I set <property>
<name>parser.character.encoding.default</name>
<value>windows-1252</value>
</property>
to <value>gb2312</value> and got a very interesting result, nutch now can crawl
the files and directories of the root directoy of chinese ftp sites without any messy
characters,but can NOT crawl any files in SUBdirectories,just got a result :404 no found.
I know there must be something wrong in config files but how and where can I config nutch to crawl a chinese ftp site?
I 've been working on this problem for halt a month and find no way to solve it, Could anyone helo me???
thanks