Hi,Håvard.

Ok, thanks a lot! I'll apply this filter now.
On more thing..
If I disallowed 'com' zone and my url file contains some com domains
would they bee indexed or NOT?



> Like this

> +http://[^/]*\.(com|org|net|biz|mil|us|info|cc)/
> -.*

> see:
> http://www.mail-archive.com/[email protected]/msg00479.html

> Dima Mazmanov wrote:
>> I'm not adding urls into urlfilter files.
>> Besides, I still don't understand how to allow only one zone in 
>> urlfilter.
>> Let's say I want to index only ".ge" zone.
>> Which one of the following filters is correct?
>>
>> +^http://([a-z0-9]*\.)*([a-z0-9]*\.).ge/
>> +^http://([a-z0-9\-\.]*\.)*.ge/
>> +^http://([a-z0-9\-\.])*.ge/
>> +^http://www\..*\.ge/
>> +^http://www\..*\.*\.ge/
>>
>> By the way if the site you are indexing is dynamic you may just 
>> disallow to index
>> www.bbc.co.uk and index only second one.
>>
>>
>>> So what filter settings do you use?
>>> Like this +^http://([a-z0-9]*\.)*bbc.co.uk/
>>> Then you will get bbc.co.uk and www.bbc.co.uk <http://www.bbc.co.uk/>
>>> and
>>> since this site is dynamic, content might bee different.
>>> Have the same problem myself :-(
>>>
>>>
>>>
>>>
>>> -----------------------------------
>>> Well my script already contains this command....
>>>
>>>
>>>
>>>
>>>    Run bin/nutch dedup segments dedup.tmp
>>>
>>>
>>>    Dima Mazmanov wrote:
>>>
>>>        Hi all!! I'm running on nutch-0.7.1.
>>>
>>>        Here is result of my search.
>>>
>>>
>>>        ArGo Software Design Homepage [html] - 30.2 k - ... Look of our
>>>        Web Site Our web site has new look and ... link on the ...
>>>        http://www.argosoft.org/RootPages/Default.aspx (Cached) ArGo
>>>        Software Design Homepage [html] - 30.2 k - ... Look of our Web
>>>        Site Our web site has new look and ... link on the ...
>>>        http://www.argosoft.com/rootpages/Default.aspx (Cached) ArGo
>>>        Software Design Homepage [html] - 30.2 k - ... Look of our Web
>>>        Site Our web site has new look and ... link on the ...
>>>        http://www.argosoft.com/RootPages/Default.aspx (Cached) ArGo
>>>        Software Design Homepage [html] - 30.2 k - ... Look of our Web
>>>        Site Our web site has new look and ... link on the ...
>>>        http://www.argosoft.org/rootpages/Default.aspx (Cached)
>>>
>>>        As you can see one result is shown multiple times.
>>>        Why so? What is the difference between these links? I don't
>>> see any..
>>>        So, how can I avoid this problem?
>>>        Thanks, Regards, Dima
>>>
>>>
>>>
>>>
>>>
>>>
>>> __________ NOD32 1.1497 (20060419) Information __________
>>>
>>> This message was checked by NOD32 antivirus system.
>>> http://www.eset.com
>>>
>>>
>>
>>



> __________ NOD32 1.1497 (20060419) Information __________

> This message was checked by NOD32 antivirus system.
> http://www.eset.com




-- 
Regards,
 Dima                          mailto:[EMAIL PROTECTED]



-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to