My replies inline. On Fri, Apr 4, 2008 at 12:47 PM, Vineet Garg <[EMAIL PROTECTED]> wrote: > Hi > > Thanks for the response. Maybe I was not clear in expressing myself. > > I am crawling a parent directory in my 'home' on Linux machine therefore my > urls have to begin with file: and not http:. I have defined the file > protocol and the crawl too is okay. My question is though I have modified > the crawl-urlfilter.xml to skip certain file types (or extensions like > .css, > pdf, xml, php and so on) why is the crawl still looking for those file > types and throwing errors? How can I avoid this because it is unnecessarily > looking for file types that I have specified to be skipped. This is simply > wastage of time.
But since you have allowed 'file:' before disallowing '.css', your second regex is ignored. Only the first regex that matches is taken into account. If you want .css to be skipped, you should put the -\.(css|gif|... line before +^(file... line. > Our requirement is to perform crawl and index two different directories > residing in our product installation, therefore both my urls begin with > file:///. > > My second query is: > > Before I deploy nutch to tomcat if I run a NutchBean command to test the > crawl it always gives 0 hits or a single hit and displays an xml file name. > As mentioned earlier I have modified the urlfilter.txt to skip the .xml In the crawl-urlfilter.txt that you have shown us, I can't see a regex for .xml. > types still only an xml is displayed. Any idea why? Of course after > deployment when I perform a search I get the required number of hits. Where > could I be going wrong? This is strange. I have never encountered this. Can you show us the directory structure of the 'crawl' directory and the logs generated when you enter the command first time and get 0 hits? Regards, Susam Pal > > Susam Pal wrote: > > > Find my reply inline. > > > > On Wed, Apr 2, 2008 at 5:04 PM, Vineet Garg <[EMAIL PROTECTED]> wrote: > > > > > > > Hi, > > > I am using Nutch to crawl local file system. I am crawling by > bin/nutch > > > crawl urls -dir crawl -depth 5 -topN 500 > & crawl.log. > > > But nutch is fetching files e.g. .css or .png files which i have set to > be > > > skipped in crawl-urlfilter.txt file and throwing error while parsing: > > > > > > fetching file:/hm/vineetg/SPD38/libraries/old_spb/.spw_hidden > > > fetching file:/hm/vineetg/SPD38/share/doc/spwlibcomm/catalog.css > > > fetching file:/hm/vineetg/SPD38/libraries/example_hiers/.spw_hidden > > > fetching > > > file:/hm/vineetg/SPD38/share/doc/spwlibcomm/spwlibcommPreface_2.html > > > fetching file:/hm/vineetg/SPD38/share/doc/spweul/images/ > > > fetching file:/hm/vineetg/SPD38/share/doc/spwfds/wwhdata/ > > > fetching > > > file:/hm/vineetg/SPD38/share/doc/spw2xml/DBMigrator_methodology_5.html > > > fetching > > > > file:/hm/vineetg/SPD38/share/doc/spwtutorial_advanced/spwtutorial_advancedPreface_4.html > > > fetching file:/hm/vineetg/SPD38/share/doc/spwcsc/title_1.html > > > fetching file:/hm/vineetg/SPD38/share/doc/spwveis136/images/ > > > Error parsing: file:/hm/vineetg/SPD38/libraries/old_spb/.spw_hidden: > > > failed(2,200): org.apache.nutch.parse.ParseException: parser not found > for > > > contentType= url=file:/hm/vineetg/SPD38/libraries/old_spb/.spw_hidden > > > fetching file:/hm/vineetg/SPD38/share/doc/spwlibwlan/chap1_6.html > > > Error parsing: > file:/hm/vineetg/SPD38/libraries/example_hiers/.spw_hidden: > > > failed(2,200): org.apache.nutch.parse.ParseException: parser not found > for > > > contentType= > url=file:/hm/vineetg/SPD38/libraries/example_hiers/.spw_hidden > > > fetching file:/hm/vineetg/SPD38/libraries/cdma_rtl/fir/ > > > Error parsing: file:/hm/vineetg/SPD38/share/doc/spwlibcomm/catalog.css: > > > failed(2,200): org.apache.nutch.parse.ParseException: parser not found > for > > > contentType=text/css > > > url=file:/hm/vineetg/SPD38/share/doc/spwlibcomm/catalog.css > > > > > > > > > my crawl-urlfilter file is:# The url filter file used by the crawl > command. > > > > > > # Better for intranet crawling. > > > # Be sure to change MY.DOMAIN.NAME to your domain name. > > > > > > # Each non-comment, non-blank line contains a regular expression > > > # prefixed by '+' or '-'. The first matching pattern in the file > > > # determines whether a URL is included or ignored. If no pattern > > > # matches, the URL is ignored. > > > > > > # skip http:, ftp:, & mailto: urls > > > #-^(http|ftp|mailto): > > > +^(file|ftp|mailto): > > > > > > > > > > You have allowed URLs beginning with "file:". Since, this is the first > > regular expression that matches with the URLs being crawled, the rest > > of the crawl-urlfilter.txt is ignored. If you read the comments in > > this file, you'll find that it says, "The first matching pattern in > > the file determines whether a URL is included or ignored." > > > > Hope this helps. > > > > Regards, > > Susam Pal > > > > > > > > > > > > # skip image and other suffixes we can't yet parse > > > > > > > -\.(css|gif|GIF|jpg|JPG|png|PNG|ico|ICO|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ > > > > > > What could be the reason?? > > > > > > Regards, > > > Vineet > > > > > > > > > > > > > > > > >
