Re: Nutch and its Growing Capabilities

nutch_newbie Sun, 05 Oct 2008 12:30:06 -0700

here it is:
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.


# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*healthline.com/
+^http://([a-z0-9]*\.)*healthfind.com/
+^http://([a-z0-9]*\.)*omnimedicalsearch.com/
+^http://([a-z0-9]*\.)*nih.gov
+^http://([a-z0-9]*\.)*cdc.gov/
+^http://([a-z0-9]*\.)*cancer.gov
+^http://([a-z0-9]*\.)*medpagetoday.com/
+^http://([a-z0-9]*\.)*fda.gov
+^http://([a-z0-9]*\.)*ovid.com
+^http://([a-z0-9]*\.)*intute.ac.uk
+^http://([a-z0-9]*\.)*guideline.gov
+^http://([a-z0-9]*\.)*jwatch.org
+^http://([a-z0-9]*\.)*clinicaltrials.gov
+^http://([a-z0-9]*\.)*centerwatch.com
+^http://([a-z0-9]*\.)*eMedicine.com
+^http://([a-z0-9]*\.)*rxlist.com
+^http://([a-z0-9]*\.)*oncolink.com
+^http://([a-z0-9]*\.)*omnimedicalsearch.com
+^http://([a-z0-9]*\.)*mwsearch.com/
+^http://([a-z0-9]*\.)*hon.ch/MedHunt/
+^http://([a-z0-9]*\.)*medicinenet.com
+^http://([a-z0-9]*\.)*webmd.com/
+^http://([a-z0-9]*\.)*medlineplus.gov/
+^http://([a-z0-9]*\.)*emedisearch.com
+^http://([a-z0-9]*\.)*diabetes-experts.com
+^http://([a-z0-9]*\.)*obesity-experts.com
+^http://([a-z0-9]*\.)*insomnia-treatment101.com
+^http://([a-z0-9]*\.)*bursitis101.com
+^http://([a-z0-9]*\.)*prostate-experts.com
+^http://([a-z0-9]*\.)*cystic-fibrosis101.com
+^http://([a-z0-9]*\.)*acid-reflux101.com
+^http://([a-z0-9]*\.)*addiction-treatment101.com
+^http://([a-z0-9]*\.)*medicalndx.com/
+^http://([a-z0-9]*\.)*mwsearch.com
+^http://([a-z0-9]*\.)*ncbi.nlm.nih.gov/pubmed
+^http://([a-z0-9]*\.)*sumsearch.uthscsa.edu/
+^http://([a-z0-9]*\.)*health.flexfinder.com
+^http://([a-z0-9]*\.)*medic8.com
+^http://([a-z0-9]*\.)*healthatoz.com
+^http://([a-z0-9]*\.)*kmle.com
+^http://([a-z0-9]*\.)*medworld.stanford.edu/medbot/
+^http://([a-z0-9]*\.)*lib.uiowa.edu/hardin/md/
+^http://([a-z0-9]*\.)*HealthAtoZ.com/
+^http://([a-z0-9]*\.)*healthfinder.gov 
+^http://([a-z0-9]*\.)*unmc.edu/library/education/internet/medsearch.
+^http://([a-z0-9]*\.)*mdlinx.com
+^http://([a-z0-9]*\.)*unmc.edu/library/education/internet/medsearch.html#medical
+^http://([a-z0-9]*\.)*hon.ch
+^http://([a-z0-9]*\.)*medbioworld.com
+^http://([a-z0-9]*\.)*medlineplus.gov
+^http://([a-z0-9]*\.)*medscape.com
+^http://([a-z0-9]*\.)*scirus.com
+^http://([a-z0-9]*\.)*metacrawler.com
+^http://([a-z0-9]*\.)*vivisimo.com/
+^http://([a-z0-9]*\.)*livegrandrounds.com
+^http://([a-z0-9]*\.)*nlm.nih.gov/
+^http://([a-z0-9]*\.)*nih.gov/
+^http://([a-z0-9]*\.)*os.dhhs.gov/
+^http://([a-z0-9]*\.)*pubmedcentral.nih.gov/
+^http://([a-z0-9]*\.)*emedicine.com/EMERG/
+^http://([a-z0-9]*\.)*emedmag.com/
+^http://([a-z0-9]*\.)*aep.org/
+^http://([a-z0-9]*\.)*aaem.org/
+^http://([a-z0-9]*\.)*abem.org/public/
+^http://([a-z0-9]*\.)*ncemi.org/
+^http://([a-z0-9]*\.)*embbs.com
+^http://([a-z0-9]*\.)*emedhome.com
+^http://([a-z0-9]*\.)*biomedcentral.com/bmcemergmed/ 
+^http://([a-z0-9]*\.)*emj.bmj.com/
+^http://([a-z0-9]*\.)*emedicine.com/emerg/index.shtml
# skip everything else
-.

and here is another version that i tried:
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
#-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*\S* 

# skip everything else
-.
 


-- 
View this message in context: 
http://www.nabble.com/Nutch-and-its-Growing-Capabilities-tp19597372p19828279.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch and its Growing Capabilities

Reply via email to