While costly in terms of slowing things down a bit, a
good regex filter will work. Perhaps you could block
search.* as well as /redir*
I'm assuming you have allowed ? perhaps you should
disable that until your filter is good.
I like to use the firefox adblock filter list, it
cleans out advertising stuff and prevents
clickthroughs and paying people for stuff a spider
might do :)
--- cut here --
/(\Wadv|banner|promo)s?(\.(?!wunder)\w+\.\w{2,3}(\.\w{2,2})?/|\W\w*\d+x\d+\.)(?!banners)/
/(absolute|ad|aff(/|iliate.*)|assets/|live|net|partypoker.*|professor|sales|serve|user|video|view|werbe)_?banner/
/(amazon\.\w+.*|barnesandnoble\.com/p.*)(.*www\.amazon(?!tag=)|&search=|amb%5F(gw|skin)|amzban|banner|cm\?t|marketing(/2|.+html)|promo|stripe?s?\W|tcg.*\.[gj])/
/(be|context|impresiones)web\.com/
/(bf|flyc|unic)ast\.com/
/(bravenetmedia|mg|openad)network\.com/
/(casaleme|rightme|travi|vibrantme)dia/
/(dtm|reactiv|regiede)pub\.com/
/(jupiter|mercury)\.bravenet/
/.(ad|ncs)reporting\.com/
//(adt|dclk)\./
//ccas(\.clearchannel|_media/)/
/[/&](affiliates?|revenue)((.*\d+x)?\d+|\.pl|\.swf|fuel|/(banner|script)s?/)/
/[/.]overture(/|.*?.*=|\w*\.js|\.com)/
/[^\w=+]promo(\w*\.js|banner|box)(?!(\.js)?\?)(\W|_|$)/
/[EMAIL
PROTECTED](\w*\d+x\d)?\d*(show)?(\w{3,}%20|alligator|avs|barter|blog|box|central|context|crystal|d?html|exchange|external|forum|front|fuse|gen|get|house|hover|http|i?frame|inline|instant|live|main|mspace|net|partner|php|primary|provider|redir\W.*\W|rotated?|secure|side|smart|sponsor|story|text|view|web)?_?ads?(v?(bot|brite|broker|bureau|butler|cent(er|ric)|click|client|content|coun(cil|t(er)?)|creative|cycle|data(id)?|engage|entry|er(tis\w+|t(pro)?|ve?r?)|farm|force|form|frame(generator)?|gen|gif|groupid|head|ima?ge?|index|info|js|juggler|layer|legend|link|log|man(ager)?|max|mentor(serve)?|meta\.com|net|optimi[sz]er|peeps|pic|po(ol|pup|sition)|proof|q\.nextag|re(dire?c?t?|mote|volver)|rom\.net|rotator|sale|script|sdk|sfac|size|so(lution|nar|urce)|space|srv|stat.*\.asp|sys|(tag)?track|trix|type|view|vt|x\.nu|zone))?s?\d*(status)?\d*(?!\.org)[\W_](?!\w+\.(ac\.|edu)|astra|aware|adurl=|block|login|nl/|sears/|.*(&sbc|\.(wmv|rm)))/
/[^a-z\d=+](get|web)?_?spons?(or(ed|s))?_?(links?)?(pots?)?(\W|_|$)(?!.*sigalert)/
/[^a-z\d=+]\d*((cible|com|context|double|euros4|fast|fine|pay-by-|smart|specific|value)_?clicks?|clicks?(2net|adhere|ban\.php|bank\.net|over|sor\.com|tag|thrutraffic|trade|xchange))(\W|$)(?!but)/
/[^a-z]banners?[/._-]?(.*(\d+x\d+\.swf|\.f?pl|_hits\.asp\?|redir|siteid=)|\.(cgi|js|php)|ad|affiliate|central|click|connect|count|current|exchange|file|grocery|id|man(age(ment|r)|ia)|newsletter|/_?promo|/rotat|/?script|serve|skyscraper|space\.|swap|tausch|trust)/
/\.(adquest|site-id|geldrace)\.nl/
/\.emediate.\w{2,3}/
/\W(absolutebm|aff_manager|annon(s(er)?|coer)|anzeigenklick|bannerit|centrport|clickad|clk_thru|contextuallinks?|falkag|klipmart|mainos(include)?|mediaturf|nyadmcncserve|offerfusion|partnermanager|paypopup|redirect.*banner|sitecatalyst|tacoda|tns-gallup|weborama|werbung|(hit|spin|google/)box(?!\.org))[\W_]/
/\W(adcase|(affiliate|popdown|view4)cash|allsponsor|deluxelink|gonamic|ivwbox|mediavantage|pay4klick|popexchange|ptadsrv|superclix|tfag|webmaster24|zanox-affiliate)\.de/
/\Wad(id=(?!$)|v(\W.*track|(/[^/]+|\d+)\.[gjs])|/house|renaline(\.cz|sk\.sk))/
/\Wimg(is|ehost)\.com/
/\Woverlay.js/
/\d+x\d+.*scraper/
/about.com/\d/(?!.*\.js)/
/banman(\.asp|pro)/
/bs\d{3,}\.gmx/
/direct(ivepub|orym|track)\.com/
/imdb.com.*\.swf/
/instant(attention|buzz)\.com/
/intelli(-direct\.com|srv\.(js|net)|txt)/
/inter(click|polls)\.com/
/link(buddies|exchange|share|synergy)/
/market(ing(/images/\d|/?promo)|banker\.com)/
/media((next|plazza)\.com|onenetwork\.net)/
/oasis(i.{0,3}\.php|\.zmh)/
/partner(\.eniro\.|2profit\.com)/
/popu(larix\.com|nder\W|pad\W|pkp)/
/qks(rv|z)\.net/
/regnow\.com.*promos/
/search(cactus|feed)\.com/
/shopping.msn.com/.*ptnrId=/
/sonnerie.*get.top/
/traffic(mp|system)\.com/
/yimg\.com(.*/adv/|/a[^u])(?!vision)/
0instant.com
1100i.com
125x125.com
265.com
2o7.net
action.ientry.net
adserveredirect
afcyhf.com
affistats.com
aftrack.asp
anrdoezrs.net
artbanners/task,clk
atdmt.com
atwola.com
audiencematch.net
avolutia.com
awaps.net
awltovhc.com
awrz.net
baventures.com
bbmedia.cz
belnk.com
bidvertiser.com
bluestreak.com
bncnt.com
bns1.net
bridgetrack.com
bs.yandex.ru
budsinc.com
cashregie.com
cc-dt.com
checkm8.com
chitika.net
cjt1.net
commission-junction.com
connextra.com
cpaffiliates.net
custom-click.com
cxtlive.com
dbbsrv.com
dgm2.com
did-it.com
dope.dk
ekmas.com
eshopoffer.aspx
espotting.com
etology.com
eyewonder.com
factortg.com
filetarget.com
filitrac.com
findology.com
floppybank.com
forrestersurveys.com
ftjcfx.com
geocities.com/js_source/
getban.php
getfound.com
hb.lycos.com
idregie.com
ifactz.com
impact.as
imrworldwide.com
indiads.com
industrybrains.com
inetinteractive.com
insightfirst.com
java.yahoo.com/a
kanoodle.com
kelkoo.fr
keymedia.hu
kontera.com
lapi.ebay.
lduhtrp.net
leadhound.com
localxml.com
log.go.com
lycos.com/catman/
maxserving.com
mercuras.com
metaffiliation.com
midaddle.com
mms3.com
myreferer.com
mytemplatestorage.com
myway.com/getSponsLinks
netavenir.com
netshelter.net
northmay.com
nvidium.com
nytimes.com/marketing
oclus.com
omguk.com
pro-market.net
promotionad
publicidad.js
questionmarket.com
rad.msn.com
realmedia.com
redcolobus.com
redsheriff.com
regnow.com
ru4.com
serving-sys.com
shareasale.com
showyoursite.com
si-net.se
smarttargetting.co
spotsystems.info
sublimemedia.net
subscriptionrocket.com
suitesmart.com
targetpoint.com
tipsurf.com
toplaboom.com
tqlkg.com
tradedoubler.com
urltrak.com
utarget.co.uk
webex.ru
yceml.net
zedo.com
--- "Insurance Squared Inc."
<[EMAIL PROTECTED]> wrote:
> A bit more info and maybe another concern.
>
> Here's an example of a url that got crawled:
>
http://search.aol.ca/redir?urn=http://www.tachyonlabs.com/games.html&url=http://www.tachyonlabs.com/games.html&requestId=fc3678a2dd20b2da&clickedItemRank=1&source=aoldirectory&searchType=MS&query=Games
>
> Not bad on the surface, however as I mentioned, this
> seems to be coming
> from a dynamic search - and there's a whole lot of
> them. Should
> we/could we be doing something to stop this?
>
> Secondly, that page is actually a redirect. It's
> crawling and indexing
> the redirected page. That'd be fine, except we've
> got some regular
> expressions in the filter that would prevent this
> redirected site from
> being indexed. However since the original url does
> pass (and the
> redirected doesn't) we end up with sites that are
> getting past the regex
> in the filter. Any general thoughts on how we
> might start to tackle this?
>
> Thanks.
>
>
> Insurance Squared Inc. wrote:
>
> > We're running a crawl using nutch and the last
> crawl seemed to be
> > taking a long time. Looking at the output, it
> seems it's gone into
> > AOL's search and is actually crawling search
> results (it's also
> > crawling some cgi-bin search results page on
> another site). This sure
> > seems like it could go on forever.
> >
> > Admittedly we haven't looked at this very deeply
> yet (I'm not sure why
> > it's got so many search pages on AOL to crawl),
> but this strikes me
> > that it's likely a common occurrence if it's
> acting that way. Is
> > there something we should be doing to prevent this
> situation?
> >
> > Thanks.
> >
>
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general