Hi,Håvard. Ok, thanks a lot! I'll apply this filter now. On more thing.. If I disallowed 'com' zone and my url file contains some com domains would they bee indexed or NOT?
> Like this > +http://[^/]*\.(com|org|net|biz|mil|us|info|cc)/ > -.* > see: > http://www.mail-archive.com/[email protected]/msg00479.html > Dima Mazmanov wrote: >> I'm not adding urls into urlfilter files. >> Besides, I still don't understand how to allow only one zone in >> urlfilter. >> Let's say I want to index only ".ge" zone. >> Which one of the following filters is correct? >> >> +^http://([a-z0-9]*\.)*([a-z0-9]*\.).ge/ >> +^http://([a-z0-9\-\.]*\.)*.ge/ >> +^http://([a-z0-9\-\.])*.ge/ >> +^http://www\..*\.ge/ >> +^http://www\..*\.*\.ge/ >> >> By the way if the site you are indexing is dynamic you may just >> disallow to index >> www.bbc.co.uk and index only second one. >> >> >>> So what filter settings do you use? >>> Like this +^http://([a-z0-9]*\.)*bbc.co.uk/ >>> Then you will get bbc.co.uk and www.bbc.co.uk <http://www.bbc.co.uk/> >>> and >>> since this site is dynamic, content might bee different. >>> Have the same problem myself :-( >>> >>> >>> >>> >>> ----------------------------------- >>> Well my script already contains this command.... >>> >>> >>> >>> >>> Run bin/nutch dedup segments dedup.tmp >>> >>> >>> Dima Mazmanov wrote: >>> >>> Hi all!! I'm running on nutch-0.7.1. >>> >>> Here is result of my search. >>> >>> >>> ArGo Software Design Homepage [html] - 30.2 k - ... Look of our >>> Web Site Our web site has new look and ... link on the ... >>> http://www.argosoft.org/RootPages/Default.aspx (Cached) ArGo >>> Software Design Homepage [html] - 30.2 k - ... Look of our Web >>> Site Our web site has new look and ... link on the ... >>> http://www.argosoft.com/rootpages/Default.aspx (Cached) ArGo >>> Software Design Homepage [html] - 30.2 k - ... Look of our Web >>> Site Our web site has new look and ... link on the ... >>> http://www.argosoft.com/RootPages/Default.aspx (Cached) ArGo >>> Software Design Homepage [html] - 30.2 k - ... Look of our Web >>> Site Our web site has new look and ... link on the ... >>> http://www.argosoft.org/rootpages/Default.aspx (Cached) >>> >>> As you can see one result is shown multiple times. >>> Why so? What is the difference between these links? I don't >>> see any.. >>> So, how can I avoid this problem? >>> Thanks, Regards, Dima >>> >>> >>> >>> >>> >>> >>> __________ NOD32 1.1497 (20060419) Information __________ >>> >>> This message was checked by NOD32 antivirus system. >>> http://www.eset.com >>> >>> >> >> > __________ NOD32 1.1497 (20060419) Information __________ > This message was checked by NOD32 antivirus system. > http://www.eset.com -- Regards, Dima mailto:[EMAIL PROTECTED] ------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
