Re: Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.

Julien Nioche Tue, 18 May 2010 09:17:08 -0700

Hi Michela,

I tried* *the following command on a* *dummy file*


*
>
> *bin/nutch plugin protocol-file  org.apache.nutch.protocol.file.File
> file:/tmp/A.M._%28album%29_8a09.html *
>

and got the expected results :

*Content-Type: text/html
> Content-Length: 47067
> Last-Modified: Tue, 18 May 2010 16:05:46 GMT*
>

I assume that your local file is named *A.M._(album)_8a09.html*, in which
case we get a 404 indeed. Could you please describe the issue in JIRA?

Thanks

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com


On 18 May 2010 15:18, Michela Becchi <[email protected]> wrote:

>  Hello,
>
>
>
> I am performing a local file system crawling.
>
> My problem is the following: all files that contain some hexadecimal
> characters in the name do not get crawled.
>
>
>
> For example, I will see the following error:
>
>
>
> fetching
> file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html
>
> org.apache.nutch.protocol.file.FileError: File Error: 404
>
>         at
> org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92)
>
>         at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535)
>
> fetch of
> file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html
> failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
>
>
>
> I am using nutch-1.0.
>
>
>
> Among other standard settings, I configured nutch-site.conf as follows:
>
>
>
> <property>
>
>   <name>plugin.includes</name>
>
>
> <value>protocol-file|protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>
>   <description>Regular expression naming plugin directory names to
>
>   include.  Any plugin not matching this expression is excluded.
>
>   In any case you need at least include the nutch-extensionpoints plugin.
> By
>
>   default Nutch includes crawling just HTML and plain text via HTTP,
>
>   and basic indexing and search plugins. In order to use HTTPS please
> enable
>
>   protocol-httpclient, but be aware of possible intermittent problems with
> the
>
>   underlying commons-httpclient library.
>
>   </description>
>
> </property>
>
>
>
> <property>
>
>   <name>file.content.limit</name>
>
>   <value>-1</value>
>
> </property>
>
>
>
> Moreover, crawl-urlfilter.txt   looks like:
>
>
>
> # skip http:, ftp:, & mailto: urls
>
> -^(http|ftp|mailto):
>
>
>
> # skip image and other suffixes we can't yet parse
>
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
>
>
> # skip URLs containing certain characters as probable queries, etc.
>
> -[...@=]
>
>
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
>
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
>
>
> # accept hosts in MY.DOMAIN.NAME
>
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>
>
>
> # accept everything else
>
> +.*
>
> ~
>
>
>
> ---
>
>
>
> Thanks,
>
>
>
> Michela
>
>
>
>
>

Re: Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.

Reply via email to