Crawling - File Error 404 when fetching file with an chinese word in the file
name
-----------------------------------------------------------------------------------
Key: NUTCH-968
URL: https://issues.apache.org/jira/browse/NUTCH-968
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 1.2
Environment: CentOS 5.4 with zh_CN.UTF8
Reporter: Dominic Xu
I am performing a local file system crawling.
My problem is the following: all files that contain some chinese words in the
file name do not get crawled.
example:
fetching /mnt/中文.txt
I will get the error :org.apache.nutch.protocol.file.FileError: File Error: 404.
and I read ISSUE NUTCH-824 https://issues.apache.org/jira/browse/NUTCH-824
and I patch with trunk : Committed revision 1056394.
but the bug no fix.
I fix the problem by modifying the file :
src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/FileResponse.java
262 for (int i=0; i<list.length; i++) {
263 f = list[i];
264 String name = f.getName();
265 +try {
266 + // specify the encoding via the config later?
267 + name = java.net.URLEncoder.encode(name, "UTF-8");
268 + } catch (UnsupportedEncodingException ex) {
269 + }
270 +
271 String time = HttpDateFormat.toString(f.lastModified());
There is must encode by utf8.
and I modify the content with meta tag.
251- StringBuffer x = new StringBuffer("<html><head>");
251+ StringBuffer x = new StringBuffer("<html><head><meta
http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />");
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira