here it is: # The url filter file used by the crawl command. # Better for intranet crawling. # Be sure to change MY.DOMAIN.NAME to your domain name.
# Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/.+?)/.*?\1/.*?\1/ # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*healthline.com/ +^http://([a-z0-9]*\.)*healthfind.com/ +^http://([a-z0-9]*\.)*omnimedicalsearch.com/ +^http://([a-z0-9]*\.)*nih.gov +^http://([a-z0-9]*\.)*cdc.gov/ +^http://([a-z0-9]*\.)*cancer.gov +^http://([a-z0-9]*\.)*medpagetoday.com/ +^http://([a-z0-9]*\.)*fda.gov +^http://([a-z0-9]*\.)*ovid.com +^http://([a-z0-9]*\.)*intute.ac.uk +^http://([a-z0-9]*\.)*guideline.gov +^http://([a-z0-9]*\.)*jwatch.org +^http://([a-z0-9]*\.)*clinicaltrials.gov +^http://([a-z0-9]*\.)*centerwatch.com +^http://([a-z0-9]*\.)*eMedicine.com +^http://([a-z0-9]*\.)*rxlist.com +^http://([a-z0-9]*\.)*oncolink.com +^http://([a-z0-9]*\.)*omnimedicalsearch.com +^http://([a-z0-9]*\.)*mwsearch.com/ +^http://([a-z0-9]*\.)*hon.ch/MedHunt/ +^http://([a-z0-9]*\.)*medicinenet.com +^http://([a-z0-9]*\.)*webmd.com/ +^http://([a-z0-9]*\.)*medlineplus.gov/ +^http://([a-z0-9]*\.)*emedisearch.com +^http://([a-z0-9]*\.)*diabetes-experts.com +^http://([a-z0-9]*\.)*obesity-experts.com +^http://([a-z0-9]*\.)*insomnia-treatment101.com +^http://([a-z0-9]*\.)*bursitis101.com +^http://([a-z0-9]*\.)*prostate-experts.com +^http://([a-z0-9]*\.)*cystic-fibrosis101.com +^http://([a-z0-9]*\.)*acid-reflux101.com +^http://([a-z0-9]*\.)*addiction-treatment101.com +^http://([a-z0-9]*\.)*medicalndx.com/ +^http://([a-z0-9]*\.)*mwsearch.com +^http://([a-z0-9]*\.)*ncbi.nlm.nih.gov/pubmed +^http://([a-z0-9]*\.)*sumsearch.uthscsa.edu/ +^http://([a-z0-9]*\.)*health.flexfinder.com +^http://([a-z0-9]*\.)*medic8.com +^http://([a-z0-9]*\.)*healthatoz.com +^http://([a-z0-9]*\.)*kmle.com +^http://([a-z0-9]*\.)*medworld.stanford.edu/medbot/ +^http://([a-z0-9]*\.)*lib.uiowa.edu/hardin/md/ +^http://([a-z0-9]*\.)*HealthAtoZ.com/ +^http://([a-z0-9]*\.)*healthfinder.gov +^http://([a-z0-9]*\.)*unmc.edu/library/education/internet/medsearch. +^http://([a-z0-9]*\.)*mdlinx.com +^http://([a-z0-9]*\.)*unmc.edu/library/education/internet/medsearch.html#medical +^http://([a-z0-9]*\.)*hon.ch +^http://([a-z0-9]*\.)*medbioworld.com +^http://([a-z0-9]*\.)*medlineplus.gov +^http://([a-z0-9]*\.)*medscape.com +^http://([a-z0-9]*\.)*scirus.com +^http://([a-z0-9]*\.)*metacrawler.com +^http://([a-z0-9]*\.)*vivisimo.com/ +^http://([a-z0-9]*\.)*livegrandrounds.com +^http://([a-z0-9]*\.)*nlm.nih.gov/ +^http://([a-z0-9]*\.)*nih.gov/ +^http://([a-z0-9]*\.)*os.dhhs.gov/ +^http://([a-z0-9]*\.)*pubmedcentral.nih.gov/ +^http://([a-z0-9]*\.)*emedicine.com/EMERG/ +^http://([a-z0-9]*\.)*emedmag.com/ +^http://([a-z0-9]*\.)*aep.org/ +^http://([a-z0-9]*\.)*aaem.org/ +^http://([a-z0-9]*\.)*abem.org/public/ +^http://([a-z0-9]*\.)*ncemi.org/ +^http://([a-z0-9]*\.)*embbs.com +^http://([a-z0-9]*\.)*emedhome.com +^http://([a-z0-9]*\.)*biomedcentral.com/bmcemergmed/ +^http://([a-z0-9]*\.)*emj.bmj.com/ +^http://([a-z0-9]*\.)*emedicine.com/emerg/index.shtml # skip everything else -. and here is another version that i tried: # The url filter file used by the crawl command. # Better for intranet crawling. # Be sure to change MY.DOMAIN.NAME to your domain name. # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops #-.*(/.+?)/.*?\1/.*?\1/ # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*\S* # skip everything else -. -- View this message in context: http://www.nabble.com/Nutch-and-its-Growing-Capabilities-tp19597372p19828279.html Sent from the Nutch - User mailing list archive at Nabble.com.
