Hi Michela, I tried* *the following command on a* *dummy file*
* > > *bin/nutch plugin protocol-file org.apache.nutch.protocol.file.File > file:/tmp/A.M._%28album%29_8a09.html * > and got the expected results : *Content-Type: text/html > Content-Length: 47067 > Last-Modified: Tue, 18 May 2010 16:05:46 GMT* > I assume that your local file is named *A.M._(album)_8a09.html*, in which case we get a 404 indeed. Could you please describe the issue in JIRA? Thanks Julien -- DigitalPebble Ltd http://www.digitalpebble.com On 18 May 2010 15:18, Michela Becchi <[email protected]> wrote: > Hello, > > > > I am performing a local file system crawling. > > My problem is the following: all files that contain some hexadecimal > characters in the name do not get crawled. > > > > For example, I will see the following error: > > > > fetching > file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html > > org.apache.nutch.protocol.file.FileError: File Error: 404 > > at > org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92) > > at > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535) > > fetch of > file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html > failed with: org.apache.nutch.protocol.file.FileError: File Error: 404 > > > > I am using nutch-1.0. > > > > Among other standard settings, I configured nutch-site.conf as follows: > > > > <property> > > <name>plugin.includes</name> > > > <value>protocol-file|protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > > <description>Regular expression naming plugin directory names to > > include. Any plugin not matching this expression is excluded. > > In any case you need at least include the nutch-extensionpoints plugin. > By > > default Nutch includes crawling just HTML and plain text via HTTP, > > and basic indexing and search plugins. In order to use HTTPS please > enable > > protocol-httpclient, but be aware of possible intermittent problems with > the > > underlying commons-httpclient library. > > </description> > > </property> > > > > <property> > > <name>file.content.limit</name> > > <value>-1</value> > > </property> > > > > Moreover, crawl-urlfilter.txt looks like: > > > > # skip http:, ftp:, & mailto: urls > > -^(http|ftp|mailto): > > > > # skip image and other suffixes we can't yet parse > > > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ > > > > # skip URLs containing certain characters as probable queries, etc. > > -[...@=] > > > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > > -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > > > > # accept hosts in MY.DOMAIN.NAME > > #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ > > > > # accept everything else > > +.* > > ~ > > > > --- > > > > Thanks, > > > > Michela > > > > >

