[jira] Resolved: (NUTCH-824) Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.
[ https://issues.apache.org/jira/browse/NUTCH-824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michela Becchi resolved NUTCH-824. -- Fix Version/s: 1.0.0 Resolution: Fixed Hi, I fixed (or, at least, circumvented) this by modifying the org/apache.nutch.protocol.file.FileResponse class belonging to the protocol-file plugin. In particular, at line 120, I added 120 String path = .equals(url.getPath()) ? / : url.getPath(); 121 +String decoded_path = path; //@Michela 122 123 +try { 124 +decoded_path=java.net.URLDecoder.decode(path,UTF-8); 125 +}catch(Exception ex){ 126 +} Then, rather than - java.io.File f = new java.io.File(path); I have + java.io.File f = new java.io.File(decoded_path); Thanks, Michela Crawling - File Error 404 when fetching file with an hexadecimal character in the file name. Key: NUTCH-824 URL: https://issues.apache.org/jira/browse/NUTCH-824 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Linux nube 2.6.31-20-server #58-Ubuntu SMP x86_64 GNU/Linux Reporter: Michela Becchi Fix For: 1.0.0 Hello, I am performing a local file system crawling. My problem is the following: all files that contain some hexadecimal characters in the name do not get crawled. For example, I will see the following error: fetching file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html org.apache.nutch.protocol.file.FileError: File Error: 404 at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535) fetch of file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html failed with: org.apache.nutch.protocol.file.FileError: File Error: 404 I am using nutch-1.0. Among other standard settings, I configured nutch-site.conf as follows: property nameplugin.includes/name valueprotocol-file|protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property property namefile.content.limit/name value-1/value /property Moreover, crawl-urlfilter.txt looks like: # skip http:, ftp:, mailto: urls -^(http|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. -[...@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept hosts in MY.DOMAIN.NAME #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ # accept everything else +.* ~ --- Thanks, Michela -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.
Hi, I circumvented this problem by modifying the org.apache.nutch.protocol.file.FileResponse class belonging to the protocol-file plugin. In particular, at line 120, I added String path = .equals(url.getPath()) ? / : url.getPath(); +String decoded_path = path; +try { + decoded_path=java.net.URLDecoder.decode(path,UTF-8); +}catch(Exception ex){} Then, rather than - java.io.File f = new java.io.File(path); I have + java.io.File f = new java.io.File(decoded_path); Thanks, Michela -- View this message in context: http://lucene.472066.n3.nabble.com/Crawling-File-Error-404-when-fetching-file-with-an-hexadecimal-character-in-the-file-name-tp826407p848871.html Sent from the Nutch - Dev mailing list archive at Nabble.com.
[jira] Created: (NUTCH-824) Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.
Crawling - File Error 404 when fetching file with an hexadecimal character in the file name. Key: NUTCH-824 URL: https://issues.apache.org/jira/browse/NUTCH-824 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Linux nube 2.6.31-20-server #58-Ubuntu SMP x86_64 GNU/Linux Reporter: Michela Becchi Priority: Blocker Hello, I am performing a local file system crawling. My problem is the following: all files that contain some hexadecimal characters in the name do not get crawled. For example, I will see the following error: fetching file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html org.apache.nutch.protocol.file.FileError: File Error: 404 at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535) fetch of file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html failed with: org.apache.nutch.protocol.file.FileError: File Error: 404 I am using nutch-1.0. Among other standard settings, I configured nutch-site.conf as follows: property nameplugin.includes/name valueprotocol-file|protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property property namefile.content.limit/name value-1/value /property Moreover, crawl-urlfilter.txt looks like: # skip http:, ftp:, mailto: urls -^(http|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. -[...@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept hosts in MY.DOMAIN.NAME #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ # accept everything else +.* ~ --- Thanks, Michela -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.
Hi Julien, Thanks a lot. I tried the same test you indicated (bin/nutch plugin protocol-file org.apache.nutch.protocol.file ...) and got again an Error 404. Of course, I don't get this error if, when issuing the command, I replace the hexadecimal representation (e.g., %28 with (). I opened an issue in JIRA, as you suggested. Michela -- View this message in context: http://lucene.472066.n3.nabble.com/Crawling-File-Error-404-when-fetching-file-with-an-hexadecimal-character-in-the-file-name-tp826407p832811.html Sent from the Nutch - Dev mailing list archive at Nabble.com.