Justin, Normally, you can include the hyphen at the start (or end) of the [], so it is considered a hyphen and not a range marker. However, I have rewritten the regex a little so that it conforms to a few rules:
There should be at least one subdomain level: Each subdomain level should be at least one character or digit, followed by zero or more of (hyphen followed by at least one character or digit), followed by period/full stop. Therefore, the hyphen is now outside the [] anyway (it must be surrounded by digit(s) or character(s), so cannot be within the []: +^http://([a-z0-9]+(-[a-z0-9]+)*\.)+co.uk/ Hope that does the trick (haven't actually tested it though...). Best regards, Alan _________________________ Alan Tanaman iDNA Solutions -----Original Message----- From: Justin Hartman [mailto:[EMAIL PROTECTED] Sent: 28 December 2006 22:21 To: [email protected] Subject: Re: DmozParser Question Thanks Sean, I appreciate this help. In the interim what I did was create a regular expression in my regex-urlfilter.txt as follows: +^http://([a-z0-9]*\.)([a-z0-9]*\.)co.uk/ and then duplicated this for other domains such as .org, .net etc. In my nutch-site.xml file I added this: <property> <name>urlfilter.regex.file</name> <value>regex-urlfilter.txt</value> </property> Because I didn't know how to parse the Dmoz file I figured this would be a work-around only crawling the sites for the tld's I want indexed and the above works. I'm pleased you sent your mail because now I can exclude the other domains before the crawl which is exactly what I wanted to do. The only potential problem I see with my regular expression in regex-urlfilter.txt is that domains such as www.my-domain-name.co.uk will not be included because the regex doesn't include hyphens in the expression. Any ideas how I can refine the regex better? Regards Justin On 12/28/06, Sean Dean <[EMAIL PROTECTED]> wrote: > This isn't exactly what your requesting, but it will get the job done in about the same time possibly even less. > > Lets use grep on that file: > > grep '\.co\.uk/' urls > co-uk-urls > > The "\" tells it to use "." in the search, normally its used for wild-carding. The forward slash at the end is more useful with other TLD's, example would be using ".ca" without you would get domains like www.caexample.net because it still does match. The ">" outputs it into our new file, which is "co-uk-urls" and ready to be injected into the Nutch DB. > > Lazy mans solution right here. Enjoy! > > ----- Original Message ---- > From: Justin Hartman <[EMAIL PROTECTED]> > To: [email protected] > Sent: Thursday, December 28, 2006 5:08:30 AM > Subject: DmozParser Question > > > Hi All > > I'm a newbie to Nutch and as such have a few questions. For now I'll > limit my questions simply because I want to try and see if I can get > my issues resolved myself but there is a question about the DmozParser > which I would like to ask. > > Does anyone know if it is possible to filter the Dmoz file to only > include certain tld's such as .co.uk only in the dmoz/url file? > > I noticed that DmozParser supports both boolean and pattern however > I'm not really sure how to implement it. > > Any help appreciated. > -- > Regards > Justin Hartman > PGP Key ID: 102CC123 > -- Regards Justin Hartman PGP Key ID: 102CC123 ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
