How about conf/crawl-urlfilter.txt ?? Marcin
On 5/4/07, simon_ece <[EMAIL PROTECTED]> wrote:
hi all, i am new to Nutch. I would like to crawl a particular site and get the result in the following pattern.I dont want to list other urls from the Crwaled site. Site to be Crwal :eg" www.example.com ^http://([a-z0-9]*\.)example.com/([a-zA-Z]*)-\([a-z0-9]*\)-.*-\([0-9]*-[A-Za-z0-9]*\)\.html$ i can crawl and geting all the matching urls from the site, i dont know how to filterout the urls and get only the particular urls, kindly post the suggestions Thanks & Regards Simon -- View this message in context: http://www.nabble.com/Nutch---Filtering-%28REGEX%29-tf3690583.html#a10318059 Sent from the Nutch - User mailing list archive at Nabble.com.
