Hello Vinet, Try using regex-urlfilter instead of crawl-urlfilter.
Regards, Arkadi > -----Original Message----- > From: Vineet Garg [mailto:[EMAIL PROTECTED] > Sent: Wednesday, April 02, 2008 10:34 PM > To: [email protected] > Subject: Nutch fetching skipped files > > Hi, > I am using Nutch to crawl local file system. I am crawling by bin/nutch > crawl urls -dir crawl -depth 5 -topN 500 > & crawl.log. > But nutch is fetching files e.g. .css or .png files which i have set to > be skipped in crawl-urlfilter.txt file and throwing error while parsing: > > fetching file:/hm/vineetg/SPD38/libraries/old_spb/.spw_hidden > fetching file:/hm/vineetg/SPD38/share/doc/spwlibcomm/catalog.css > fetching file:/hm/vineetg/SPD38/libraries/example_hiers/.spw_hidden > fetching > file:/hm/vineetg/SPD38/share/doc/spwlibcomm/spwlibcommPreface_2.html > fetching file:/hm/vineetg/SPD38/share/doc/spweul/images/ > fetching file:/hm/vineetg/SPD38/share/doc/spwfds/wwhdata/ > fetching > file:/hm/vineetg/SPD38/share/doc/spw2xml/DBMigrator_methodology_5.html > fetching > file:/hm/vineetg/SPD38/share/doc/spwtutorial_advanced/spwtutorial_advanc ed > Preface_4.html > fetching file:/hm/vineetg/SPD38/share/doc/spwcsc/title_1.html > fetching file:/hm/vineetg/SPD38/share/doc/spwveis136/images/ > Error parsing: file:/hm/vineetg/SPD38/libraries/old_spb/.spw_hidden: > failed(2,200): org.apache.nutch.parse.ParseException: parser not found > for contentType= url=file:/hm/vineetg/SPD38/libraries/old_spb/.spw_hidden > fetching file:/hm/vineetg/SPD38/share/doc/spwlibwlan/chap1_6.html > Error parsing: > file:/hm/vineetg/SPD38/libraries/example_hiers/.spw_hidden: > failed(2,200): org.apache.nutch.parse.ParseException: parser not found > for contentType= > url=file:/hm/vineetg/SPD38/libraries/example_hiers/.spw_hidden > fetching file:/hm/vineetg/SPD38/libraries/cdma_rtl/fir/ > Error parsing: file:/hm/vineetg/SPD38/share/doc/spwlibcomm/catalog.css: > failed(2,200): org.apache.nutch.parse.ParseException: parser not found > for contentType=text/css > url=file:/hm/vineetg/SPD38/share/doc/spwlibcomm/catalog.css > > > my crawl-urlfilter file is:# The url filter file used by the crawl > command. > > > # Better for intranet crawling. > # Be sure to change MY.DOMAIN.NAME to your domain name. > > > # Each non-comment, non-blank line contains a regular expression > # prefixed by '+' or '-'. The first matching pattern in the file > # determines whether a URL is included or ignored. If no pattern > # matches, the URL is ignored. > > > # skip http:, ftp:, & mailto: urls > #-^(http|ftp|mailto): > +^(file|ftp|mailto): > > > > > > > # skip image and other suffixes we can't yet parse > - > \.(css|gif|GIF|jpg|JPG|png|PNG|ico|ICO|sit|eps|wmf|zip|ppt|mpg|xls|gz|rp m| > tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ > > > # skip URLs containing certain characters as probable queries, etc. > [EMAIL PROTECTED] > > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > -.*(/.+?)/.*?\1/.*?\1/ > > > # accept hosts in MY.DOMAIN.NAME > #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ > +^file*:///hm/vineetg/SPD38/libraries/([a-zA-Z0-9]*\.) > +^file*:///hm/vineetg/SPD38/share/doc/([a-zA-Z0-9]*\.) > # skip everything else > -. > > nutch-site.xml : > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > > > <!-- Put site-specific property overrides in this file. --> > > > <configuration> > > > <property> > <name>http.agent.name</name> > <value>ESL</value> > <description></description> > </property> > > > > > > > > > <property> > <name>http.agent.description</name> > <value>MyDescription</value> > <description></description> > </property> > > > > > > > <property> > <name>http.agent.url</name> > <value>myurlcom</value> > <description></description> > </property> > > > > > > > <property> > <name>http.agent.email</name> > <value>[EMAIL PROTECTED]</value> > <description></description> > </property> > > > <property> > <name>plugin.includes</name> > > <value>protocol-http|protocol-file|regex-urlfilter|parse-html|index- > basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer- > (pass|regex|basic)</value> > <description></description> > </property> > > > > > > > <property> > <name>plugin.folders</name> > <value>/hm/vineetg/nutch-0.9/plugins</value> > <description></description> > </property> > > > > > > > <property> > <name>file.content.limit</name> > <value>-1</value> > <description>the length for downloaded content</description> > </property> > > > </configuration> > > > What could be the reason?? > > Regards, > Vineet
