Re: [Nutch-general] DmozParser Question

Alan Tanaman Thu, 28 Dec 2006 15:03:57 -0800

Justin,

Normally, you can include the hyphen at the start (or end) of the [], so it
is considered a hyphen and not a range marker.  However, I have rewritten
the regex a little so that it conforms to a few rules:


There should be at least one subdomain level:  Each subdomain level should
be at least one character or digit, followed by zero or more of (hyphen
followed by at least one character or digit), followed by period/full stop.

Therefore, the hyphen is now outside the [] anyway (it must be surrounded by
digit(s) or character(s), so cannot be within the []:
+^http://([a-z0-9]+(-[a-z0-9]+)*\.)+co.uk/

Hope that does the trick (haven't actually tested it though...).

Best regards,
Alan
_________________________
Alan Tanaman
iDNA Solutions

-----Original Message-----
From: Justin Hartman [mailto:[EMAIL PROTECTED] 
Sent: 28 December 2006 22:21
To: [email protected]
Subject: Re: DmozParser Question

Thanks Sean, I appreciate this help. In the interim what I did was
create a regular expression in my regex-urlfilter.txt as follows:

+^http://([a-z0-9]*\.)([a-z0-9]*\.)co.uk/

and then duplicated this for other domains such as .org, .net etc.

In my nutch-site.xml file I added this:

<property>
  <name>urlfilter.regex.file</name>
  <value>regex-urlfilter.txt</value>
</property>

Because I didn't know how to parse the Dmoz file I figured this would
be a work-around only crawling the sites for the tld's I want indexed
and the above works. I'm pleased you sent your mail because now I can
exclude the other domains before the crawl which is exactly what I
wanted to do.

The only potential problem I see with my regular expression in
regex-urlfilter.txt is that domains such as www.my-domain-name.co.uk
will not be included because the regex doesn't include hyphens in the
expression.

Any ideas how I can refine the regex better?

Regards
Justin

On 12/28/06, Sean Dean <[EMAIL PROTECTED]> wrote:
> This isn't exactly what your requesting, but it will get the job done in
about the same time possibly even less.
>
> Lets use grep on that file:
>
> grep '\.co\.uk/' urls > co-uk-urls
>
> The "\" tells it to use "." in the search, normally its used for
wild-carding. The forward slash at the end is more useful with other TLD's,
example would be using ".ca" without you would get domains like
www.caexample.net because it still does match. The ">" outputs it into our
new file, which is "co-uk-urls" and ready to be injected into the Nutch DB.
>
> Lazy mans solution right here. Enjoy!
>
> ----- Original Message ----
> From: Justin Hartman <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Thursday, December 28, 2006 5:08:30 AM
> Subject: DmozParser Question
>
>
> Hi All
>
> I'm a newbie to Nutch and as such have a few questions. For now I'll
> limit my questions simply because I want to try and see if I can get
> my issues resolved myself but there is a question about the DmozParser
> which I would like to ask.
>
> Does anyone know if it is possible to filter the Dmoz file to only
> include certain tld's such as .co.uk only in the dmoz/url file?
>
> I noticed that DmozParser supports both boolean and pattern however
> I'm not really sure how to implement it.
>
> Any help appreciated.
> --
> Regards
> Justin Hartman
> PGP Key ID: 102CC123
>


-- 
Regards
Justin Hartman
PGP Key ID: 102CC123


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] DmozParser Question

Reply via email to