Re: Nutch and its Growing Capabilities

Kevin MacDonald Sun, 05 Oct 2008 18:30:34 -0700

The second version looks like it should work. I would look at
Fetcher.handleRedirect() and put extra log lines around both the normalizers
and the urlfilters. It's possible that one of those is filtering out urls
that you expect to have crawled. I don't use nutch in the same way you do so
I can't offer more advice than that. Good luck.


On Sun, Oct 5, 2008 at 12:29 PM, nutch_newbie <[EMAIL PROTECTED]>wrote:

>
> here it is:
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*healthline.com/
> +^http://([a-z0-9]*\.)*healthfind.com/
> +^http://([a-z0-9]*\.)*omnimedicalsearch.com/
> +^http://([a-z0-9]*\.)*nih.gov
> +^http://([a-z0-9]*\.)*cdc.gov/
> +^http://([a-z0-9]*\.)*cancer.gov
> +^http://([a-z0-9]*\.)*medpagetoday.com/
> +^http://([a-z0-9]*\.)*fda.gov
> +^http://([a-z0-9]*\.)*ovid.com
> +^http://([a-z0-9]*\.)*intute.ac.uk
> +^http://([a-z0-9]*\.)*guideline.gov
> +^http://([a-z0-9]*\.)*jwatch.org
> +^http://([a-z0-9]*\.)*clinicaltrials.gov
> +^http://([a-z0-9]*\.)*centerwatch.com
> +^http://([a-z0-9]*\.)*eMedicine.com
> +^http://([a-z0-9]*\.)*rxlist.com
> +^http://([a-z0-9]*\.)*oncolink.com
> +^http://([a-z0-9]*\.)*omnimedicalsearch.com
> +^http://([a-z0-9]*\.)*mwsearch.com/
> +^http://([a-z0-9]*\.)*hon.ch/MedHunt/
> +^http://([a-z0-9]*\.)*medicinenet.com
> +^http://([a-z0-9]*\.)*webmd.com/
> +^http://([a-z0-9]*\.)*medlineplus.gov/
> +^http://([a-z0-9]*\.)*emedisearch.com
> +^http://([a-z0-9]*\.)*diabetes-experts.com
> +^http://([a-z0-9]*\.)*obesity-experts.com
> +^http://([a-z0-9]*\.)*insomnia-treatment101.com
> +^http://([a-z0-9]*\.)*bursitis101.com
> +^http://([a-z0-9]*\.)*prostate-experts.com
> +^http://([a-z0-9]*\.)*cystic-fibrosis101.com
> +^http://([a-z0-9]*\.)*acid-reflux101.com
> +^http://([a-z0-9]*\.)*addiction-treatment101.com
> +^http://([a-z0-9]*\.)*medicalndx.com/
> +^http://([a-z0-9]*\.)*mwsearch.com
> +^http://([a-z0-9]*\.)*ncbi.nlm.nih.gov/pubmed
> +^http://([a-z0-9]*\.)*sumsearch.uthscsa.edu/
> +^http://([a-z0-9]*\.)*health.flexfinder.com
> +^http://([a-z0-9]*\.)*medic8.com
> +^http://([a-z0-9]*\.)*healthatoz.com
> +^http://([a-z0-9]*\.)*kmle.com
> +^http://([a-z0-9]*\.)*medworld.stanford.edu/medbot/
> +^http://([a-z0-9]*\.)*lib.uiowa.edu/hardin/md/
> +^http://([a-z0-9]*\.)*HealthAtoZ.com/
> +^http://([a-z0-9]*\.)*healthfinder.gov
> +^http://([a-z0-9]*\.)*unmc.edu/library/education/internet/medsearch.
> +^http://([a-z0-9]*\.)*mdlinx.com
> +^http://([a-z0-9]*\.)*
> unmc.edu/library/education/internet/medsearch.html#medical
> +^http://([a-z0-9]*\.)*hon.ch
> +^http://([a-z0-9]*\.)*medbioworld.com
> +^http://([a-z0-9]*\.)*medlineplus.gov
> +^http://([a-z0-9]*\.)*medscape.com
> +^http://([a-z0-9]*\.)*scirus.com
> +^http://([a-z0-9]*\.)*metacrawler.com
> +^http://([a-z0-9]*\.)*vivisimo.com/
> +^http://([a-z0-9]*\.)*livegrandrounds.com
> +^http://([a-z0-9]*\.)*nlm.nih.gov/
> +^http://([a-z0-9]*\.)*nih.gov/
> +^http://([a-z0-9]*\.)*os.dhhs.gov/
> +^http://([a-z0-9]*\.)*pubmedcentral.nih.gov/
> +^http://([a-z0-9]*\.)*emedicine.com/EMERG/
> +^http://([a-z0-9]*\.)*emedmag.com/
> +^http://([a-z0-9]*\.)*aep.org/
> +^http://([a-z0-9]*\.)*aaem.org/
> +^http://([a-z0-9]*\.)*abem.org/public/
> +^http://([a-z0-9]*\.)*ncemi.org/
> +^http://([a-z0-9]*\.)*embbs.com
> +^http://([a-z0-9]*\.)*emedhome.com
> +^http://([a-z0-9]*\.)*biomedcentral.com/bmcemergmed/
> +^http://([a-z0-9]*\.)*emj.bmj.com/
> +^http://([a-z0-9]*\.)*emedicine.com/emerg/index.shtml
> # skip everything else
> -.
>
> and here is another version that i tried:
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> #-.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*\S*
>
> # skip everything else
> -.
>
>
>
> --
> View this message in context:
> http://www.nabble.com/Nutch-and-its-Growing-Capabilities-tp19597372p19828279.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: Nutch and its Growing Capabilities

Reply via email to