The second version looks like it should work. I would look at Fetcher.handleRedirect() and put extra log lines around both the normalizers and the urlfilters. It's possible that one of those is filtering out urls that you expect to have crawled. I don't use nutch in the same way you do so I can't offer more advice than that. Good luck.
On Sun, Oct 5, 2008 at 12:29 PM, nutch_newbie <[EMAIL PROTECTED]>wrote: > > here it is: > # The url filter file used by the crawl command. > > # Better for intranet crawling. > # Be sure to change MY.DOMAIN.NAME to your domain name. > > # Each non-comment, non-blank line contains a regular expression > # prefixed by '+' or '-'. The first matching pattern in the file > # determines whether a URL is included or ignored. If no pattern > # matches, the URL is ignored. > > # skip file:, ftp:, & mailto: urls > -^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > > -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$ > > # skip URLs containing certain characters as probable queries, etc. > [EMAIL PROTECTED] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > -.*(/.+?)/.*?\1/.*?\1/ > > # accept hosts in MY.DOMAIN.NAME > +^http://([a-z0-9]*\.)*healthline.com/ > +^http://([a-z0-9]*\.)*healthfind.com/ > +^http://([a-z0-9]*\.)*omnimedicalsearch.com/ > +^http://([a-z0-9]*\.)*nih.gov > +^http://([a-z0-9]*\.)*cdc.gov/ > +^http://([a-z0-9]*\.)*cancer.gov > +^http://([a-z0-9]*\.)*medpagetoday.com/ > +^http://([a-z0-9]*\.)*fda.gov > +^http://([a-z0-9]*\.)*ovid.com > +^http://([a-z0-9]*\.)*intute.ac.uk > +^http://([a-z0-9]*\.)*guideline.gov > +^http://([a-z0-9]*\.)*jwatch.org > +^http://([a-z0-9]*\.)*clinicaltrials.gov > +^http://([a-z0-9]*\.)*centerwatch.com > +^http://([a-z0-9]*\.)*eMedicine.com > +^http://([a-z0-9]*\.)*rxlist.com > +^http://([a-z0-9]*\.)*oncolink.com > +^http://([a-z0-9]*\.)*omnimedicalsearch.com > +^http://([a-z0-9]*\.)*mwsearch.com/ > +^http://([a-z0-9]*\.)*hon.ch/MedHunt/ > +^http://([a-z0-9]*\.)*medicinenet.com > +^http://([a-z0-9]*\.)*webmd.com/ > +^http://([a-z0-9]*\.)*medlineplus.gov/ > +^http://([a-z0-9]*\.)*emedisearch.com > +^http://([a-z0-9]*\.)*diabetes-experts.com > +^http://([a-z0-9]*\.)*obesity-experts.com > +^http://([a-z0-9]*\.)*insomnia-treatment101.com > +^http://([a-z0-9]*\.)*bursitis101.com > +^http://([a-z0-9]*\.)*prostate-experts.com > +^http://([a-z0-9]*\.)*cystic-fibrosis101.com > +^http://([a-z0-9]*\.)*acid-reflux101.com > +^http://([a-z0-9]*\.)*addiction-treatment101.com > +^http://([a-z0-9]*\.)*medicalndx.com/ > +^http://([a-z0-9]*\.)*mwsearch.com > +^http://([a-z0-9]*\.)*ncbi.nlm.nih.gov/pubmed > +^http://([a-z0-9]*\.)*sumsearch.uthscsa.edu/ > +^http://([a-z0-9]*\.)*health.flexfinder.com > +^http://([a-z0-9]*\.)*medic8.com > +^http://([a-z0-9]*\.)*healthatoz.com > +^http://([a-z0-9]*\.)*kmle.com > +^http://([a-z0-9]*\.)*medworld.stanford.edu/medbot/ > +^http://([a-z0-9]*\.)*lib.uiowa.edu/hardin/md/ > +^http://([a-z0-9]*\.)*HealthAtoZ.com/ > +^http://([a-z0-9]*\.)*healthfinder.gov > +^http://([a-z0-9]*\.)*unmc.edu/library/education/internet/medsearch. > +^http://([a-z0-9]*\.)*mdlinx.com > +^http://([a-z0-9]*\.)* > unmc.edu/library/education/internet/medsearch.html#medical > +^http://([a-z0-9]*\.)*hon.ch > +^http://([a-z0-9]*\.)*medbioworld.com > +^http://([a-z0-9]*\.)*medlineplus.gov > +^http://([a-z0-9]*\.)*medscape.com > +^http://([a-z0-9]*\.)*scirus.com > +^http://([a-z0-9]*\.)*metacrawler.com > +^http://([a-z0-9]*\.)*vivisimo.com/ > +^http://([a-z0-9]*\.)*livegrandrounds.com > +^http://([a-z0-9]*\.)*nlm.nih.gov/ > +^http://([a-z0-9]*\.)*nih.gov/ > +^http://([a-z0-9]*\.)*os.dhhs.gov/ > +^http://([a-z0-9]*\.)*pubmedcentral.nih.gov/ > +^http://([a-z0-9]*\.)*emedicine.com/EMERG/ > +^http://([a-z0-9]*\.)*emedmag.com/ > +^http://([a-z0-9]*\.)*aep.org/ > +^http://([a-z0-9]*\.)*aaem.org/ > +^http://([a-z0-9]*\.)*abem.org/public/ > +^http://([a-z0-9]*\.)*ncemi.org/ > +^http://([a-z0-9]*\.)*embbs.com > +^http://([a-z0-9]*\.)*emedhome.com > +^http://([a-z0-9]*\.)*biomedcentral.com/bmcemergmed/ > +^http://([a-z0-9]*\.)*emj.bmj.com/ > +^http://([a-z0-9]*\.)*emedicine.com/emerg/index.shtml > # skip everything else > -. > > and here is another version that i tried: > # The url filter file used by the crawl command. > > # Better for intranet crawling. > # Be sure to change MY.DOMAIN.NAME to your domain name. > > # Each non-comment, non-blank line contains a regular expression > # prefixed by '+' or '-'. The first matching pattern in the file > # determines whether a URL is included or ignored. If no pattern > # matches, the URL is ignored. > > # skip file:, ftp:, & mailto: urls > -^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > > -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$ > > # skip URLs containing certain characters as probable queries, etc. > [EMAIL PROTECTED] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > #-.*(/.+?)/.*?\1/.*?\1/ > > # accept hosts in MY.DOMAIN.NAME > +^http://([a-z0-9]*\.)*\S* > > # skip everything else > -. > > > > -- > View this message in context: > http://www.nabble.com/Nutch-and-its-Growing-Capabilities-tp19597372p19828279.html > Sent from the Nutch - User mailing list archive at Nabble.com. > >
