Find my reply inline. On Wed, Apr 2, 2008 at 5:04 PM, Vineet Garg <[EMAIL PROTECTED]> wrote: > Hi, > I am using Nutch to crawl local file system. I am crawling by bin/nutch > crawl urls -dir crawl -depth 5 -topN 500 > & crawl.log. > But nutch is fetching files e.g. .css or .png files which i have set to be > skipped in crawl-urlfilter.txt file and throwing error while parsing: > > fetching file:/hm/vineetg/SPD38/libraries/old_spb/.spw_hidden > fetching file:/hm/vineetg/SPD38/share/doc/spwlibcomm/catalog.css > fetching file:/hm/vineetg/SPD38/libraries/example_hiers/.spw_hidden > fetching > file:/hm/vineetg/SPD38/share/doc/spwlibcomm/spwlibcommPreface_2.html > fetching file:/hm/vineetg/SPD38/share/doc/spweul/images/ > fetching file:/hm/vineetg/SPD38/share/doc/spwfds/wwhdata/ > fetching > file:/hm/vineetg/SPD38/share/doc/spw2xml/DBMigrator_methodology_5.html > fetching > file:/hm/vineetg/SPD38/share/doc/spwtutorial_advanced/spwtutorial_advancedPreface_4.html > fetching file:/hm/vineetg/SPD38/share/doc/spwcsc/title_1.html > fetching file:/hm/vineetg/SPD38/share/doc/spwveis136/images/ > Error parsing: file:/hm/vineetg/SPD38/libraries/old_spb/.spw_hidden: > failed(2,200): org.apache.nutch.parse.ParseException: parser not found for > contentType= url=file:/hm/vineetg/SPD38/libraries/old_spb/.spw_hidden > fetching file:/hm/vineetg/SPD38/share/doc/spwlibwlan/chap1_6.html > Error parsing: file:/hm/vineetg/SPD38/libraries/example_hiers/.spw_hidden: > failed(2,200): org.apache.nutch.parse.ParseException: parser not found for > contentType= url=file:/hm/vineetg/SPD38/libraries/example_hiers/.spw_hidden > fetching file:/hm/vineetg/SPD38/libraries/cdma_rtl/fir/ > Error parsing: file:/hm/vineetg/SPD38/share/doc/spwlibcomm/catalog.css: > failed(2,200): org.apache.nutch.parse.ParseException: parser not found for > contentType=text/css > url=file:/hm/vineetg/SPD38/share/doc/spwlibcomm/catalog.css > > > my crawl-urlfilter file is:# The url filter file used by the crawl command. > > # Better for intranet crawling. > # Be sure to change MY.DOMAIN.NAME to your domain name. > > # Each non-comment, non-blank line contains a regular expression > # prefixed by '+' or '-'. The first matching pattern in the file > # determines whether a URL is included or ignored. If no pattern > # matches, the URL is ignored. > > # skip http:, ftp:, & mailto: urls > #-^(http|ftp|mailto): > +^(file|ftp|mailto):
You have allowed URLs beginning with "file:". Since, this is the first regular expression that matches with the URLs being crawled, the rest of the crawl-urlfilter.txt is ignored. If you read the comments in this file, you'll find that it says, "The first matching pattern in the file determines whether a URL is included or ignored." Hope this helps. Regards, Susam Pal > > > > # skip image and other suffixes we can't yet parse > > -\.(css|gif|GIF|jpg|JPG|png|PNG|ico|ICO|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ > > What could be the reason?? > > Regards, > Vineet >
