[jira] Resolved: (NUTCH-824) Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.

2010-05-27 Thread Michela Becchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michela Becchi resolved NUTCH-824.
--

Fix Version/s: 1.0.0
   Resolution: Fixed

Hi,

I fixed (or, at least, circumvented) this by modifying the 
org/apache.nutch.protocol.file.FileResponse class belonging to the 
protocol-file plugin.

In particular, at line 120, I added

120 String path = .equals(url.getPath()) ? / : url.getPath();
121 +String decoded_path = path;  //@Michela 
122 
123 +try {
124 +decoded_path=java.net.URLDecoder.decode(path,UTF-8);
125 +}catch(Exception ex){
126 +}

Then, rather than

- java.io.File f = new java.io.File(path);

I have

+ java.io.File f = new java.io.File(decoded_path);

Thanks,

Michela

 Crawling - File Error 404 when fetching file with an hexadecimal character in 
 the file name.
 

 Key: NUTCH-824
 URL: https://issues.apache.org/jira/browse/NUTCH-824
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Linux nube 2.6.31-20-server #58-Ubuntu SMP x86_64 
 GNU/Linux
Reporter: Michela Becchi
 Fix For: 1.0.0


 Hello,
 I am performing a local file system crawling.
 My problem is the following: all files that contain some hexadecimal 
 characters in the name do not get crawled.
 For example, I will see the following error:
 fetching 
 file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html
 org.apache.nutch.protocol.file.FileError: File Error: 404
 at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92)
 at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535)
 fetch of 
 file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html
  failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
 I am using nutch-1.0.
 Among other standard settings, I configured nutch-site.conf as follows:
 property
   nameplugin.includes/name
   
 valueprotocol-file|protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value
   descriptionRegular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please enable
   protocol-httpclient, but be aware of possible intermittent problems with the
   underlying commons-httpclient library.
   /description
 /property
 property
   namefile.content.limit/name
   value-1/value
 /property
 Moreover, crawl-urlfilter.txt   looks like:
 # skip http:, ftp:,  mailto: urls
 -^(http|ftp|mailto):
 # skip image and other suffixes we can't yet parse
 -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
 # skip URLs containing certain characters as probable queries, etc.
 -[...@=]
 # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
 -.*(/[^/]+)/[^/]+\1/[^/]+\1/
 # accept hosts in MY.DOMAIN.NAME
 #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
 # accept everything else
 +.*
 ~
 ---
 Thanks,
 Michela

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.

2010-05-27 Thread Michela Becchi

Hi,

I circumvented this problem by modifying the
org.apache.nutch.protocol.file.FileResponse class belonging to the
protocol-file plugin.

In particular, at line 120, I added

String path = .equals(url.getPath()) ? / : url.getPath();
+String decoded_path = path;
+try { 
+ decoded_path=java.net.URLDecoder.decode(path,UTF-8);
+}catch(Exception ex){}

Then, rather than

- java.io.File f = new java.io.File(path);

I have

+ java.io.File f = new java.io.File(decoded_path);

Thanks,

Michela
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Crawling-File-Error-404-when-fetching-file-with-an-hexadecimal-character-in-the-file-name-tp826407p848871.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.


[jira] Created: (NUTCH-824) Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.

2010-05-20 Thread Michela Becchi (JIRA)
Crawling - File Error 404 when fetching file with an hexadecimal character in 
the file name.


 Key: NUTCH-824
 URL: https://issues.apache.org/jira/browse/NUTCH-824
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Linux nube 2.6.31-20-server #58-Ubuntu SMP x86_64 
GNU/Linux
Reporter: Michela Becchi
Priority: Blocker


Hello,

I am performing a local file system crawling.
My problem is the following: all files that contain some hexadecimal characters 
in the name do not get crawled.

For example, I will see the following error:

fetching 
file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html
org.apache.nutch.protocol.file.FileError: File Error: 404
at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535)
fetch of 
file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html 
failed with: org.apache.nutch.protocol.file.FileError: File Error: 404

I am using nutch-1.0.

Among other standard settings, I configured nutch-site.conf as follows:

property
  nameplugin.includes/name
  
valueprotocol-file|protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value
  descriptionRegular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  /description
/property

property
  namefile.content.limit/name
  value-1/value
/property

Moreover, crawl-urlfilter.txt   looks like:

# skip http:, ftp:,  mailto: urls
-^(http|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[...@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

# accept everything else
+.*
~

---

Thanks,

Michela


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.

2010-05-20 Thread Michela Becchi

Hi Julien,

Thanks a lot.

I tried the same test you indicated (bin/nutch plugin protocol-file 
org.apache.nutch.protocol.file ...) and got again an Error 404. Of course,
I don't get this error if, when issuing the command, I replace the
hexadecimal representation (e.g., %28 with ().

I opened an issue in JIRA, as you suggested.

Michela
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Crawling-File-Error-404-when-fetching-file-with-an-hexadecimal-character-in-the-file-name-tp826407p832811.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.