Re: Nutch fetching skipped files

Susam Pal Fri, 04 Apr 2008 09:34:52 -0700

My replies inline.

On Fri, Apr 4, 2008 at 12:47 PM, Vineet Garg <[EMAIL PROTECTED]> wrote:
> Hi
>
>  Thanks for the response. Maybe I was not clear in expressing myself.
>
>  I am crawling a parent directory in my 'home' on Linux machine therefore my
>  urls have to begin with file: and not http:. I have defined the file
>  protocol and the crawl too is okay. My question is though I have modified
>  the crawl-urlfilter.xml to skip certain file types (or extensions like
> .css,
>  pdf, xml, php and so on)  why is the crawl still looking for those file
>  types and throwing errors? How can I avoid this because it is unnecessarily
>  looking for file types that I have specified to be skipped. This is simply
>  wastage of time.


But since you have allowed 'file:' before disallowing '.css', your
second regex is ignored. Only the first regex that matches is taken
into account. If you want .css to be skipped, you should put the
-\.(css|gif|... line before +^(file... line.

>  Our requirement is to perform crawl and index two different directories
>  residing in our product installation, therefore both my urls begin with
>  file:///.
>
>  My second query is:
>
>  Before I deploy nutch to tomcat if I run a NutchBean command to test the
>  crawl it always gives 0 hits or a single hit and displays an xml file name.
>  As mentioned earlier I have modified the urlfilter.txt to skip the .xml

In the crawl-urlfilter.txt that you have shown us, I can't see a regex for .xml.

>  types still only an xml is displayed. Any idea why? Of course after
>  deployment when I perform a search I get the required number of hits. Where
>  could I be going wrong?

This is strange. I have never encountered this. Can you show us the
directory structure of the 'crawl' directory and the logs generated
when you enter the command first time and get 0 hits?

Regards,
Susam Pal

>
>  Susam Pal wrote:
>
> > Find my reply inline.
> >
> > On Wed, Apr 2, 2008 at 5:04 PM, Vineet Garg <[EMAIL PROTECTED]> wrote:
> >
> >
> > > Hi,
> > >  I am using Nutch to crawl local file system. I am crawling by
> bin/nutch
> > > crawl urls -dir crawl -depth 5 -topN 500 > & crawl.log.
> > >  But nutch is fetching files e.g. .css or .png files which i have set to
> be
> > > skipped in crawl-urlfilter.txt file and throwing error while parsing:
> > >
> > >  fetching file:/hm/vineetg/SPD38/libraries/old_spb/.spw_hidden
> > >  fetching file:/hm/vineetg/SPD38/share/doc/spwlibcomm/catalog.css
> > >  fetching file:/hm/vineetg/SPD38/libraries/example_hiers/.spw_hidden
> > >  fetching
> > > file:/hm/vineetg/SPD38/share/doc/spwlibcomm/spwlibcommPreface_2.html
> > >  fetching file:/hm/vineetg/SPD38/share/doc/spweul/images/
> > >  fetching file:/hm/vineetg/SPD38/share/doc/spwfds/wwhdata/
> > >  fetching
> > > file:/hm/vineetg/SPD38/share/doc/spw2xml/DBMigrator_methodology_5.html
> > >  fetching
> > >
> file:/hm/vineetg/SPD38/share/doc/spwtutorial_advanced/spwtutorial_advancedPreface_4.html
> > >  fetching file:/hm/vineetg/SPD38/share/doc/spwcsc/title_1.html
> > >  fetching file:/hm/vineetg/SPD38/share/doc/spwveis136/images/
> > >  Error parsing: file:/hm/vineetg/SPD38/libraries/old_spb/.spw_hidden:
> > > failed(2,200): org.apache.nutch.parse.ParseException: parser not found
> for
> > > contentType= url=file:/hm/vineetg/SPD38/libraries/old_spb/.spw_hidden
> > >  fetching file:/hm/vineetg/SPD38/share/doc/spwlibwlan/chap1_6.html
> > >  Error parsing:
> file:/hm/vineetg/SPD38/libraries/example_hiers/.spw_hidden:
> > > failed(2,200): org.apache.nutch.parse.ParseException: parser not found
> for
> > > contentType=
> url=file:/hm/vineetg/SPD38/libraries/example_hiers/.spw_hidden
> > >  fetching file:/hm/vineetg/SPD38/libraries/cdma_rtl/fir/
> > >  Error parsing: file:/hm/vineetg/SPD38/share/doc/spwlibcomm/catalog.css:
> > > failed(2,200): org.apache.nutch.parse.ParseException: parser not found
> for
> > > contentType=text/css
> > > url=file:/hm/vineetg/SPD38/share/doc/spwlibcomm/catalog.css
> > >
> > >
> > >  my crawl-urlfilter file is:# The url filter file used by the crawl
> command.
> > >
> > >  # Better for intranet crawling.
> > >  # Be sure to change MY.DOMAIN.NAME to your domain name.
> > >
> > >  # Each non-comment, non-blank line contains a regular expression
> > >  # prefixed by '+' or '-'.  The first matching pattern in the file
> > >  # determines whether a URL is included or ignored.  If no pattern
> > >  # matches, the URL is ignored.
> > >
> > >  # skip http:, ftp:, & mailto: urls
> > >  #-^(http|ftp|mailto):
> > >  +^(file|ftp|mailto):
> > >
> > >
> >
> > You have allowed URLs beginning with "file:". Since, this is the first
> > regular expression that matches with the URLs being crawled, the rest
> > of the crawl-urlfilter.txt is ignored. If you read the comments in
> > this file, you'll find that it says, "The first matching pattern in
> > the file determines whether a URL is included or ignored."
> >
> > Hope this helps.
> >
> > Regards,
> > Susam Pal
> >
> >
> >
> > >
> > >  # skip image and other suffixes we can't yet parse
> > >
> > >
> -\.(css|gif|GIF|jpg|JPG|png|PNG|ico|ICO|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> > >
> > >  What could be the reason??
> > >
> > >  Regards,
> > >  Vineet
> > >
> > >
> > >
> >
> >
> >
>
>

Re: Nutch fetching skipped files

Reply via email to