Re: crawl problems (a bug/patch)

Earl Cahill Thu, 20 Oct 2005 13:29:42 -0700

Hi Sébastien,

Yahoo! just hosed my message, glad I had it elsewhere.


> As you probably saw in the OutlinkExtractor class,
> the links are
> extracted with a Regexp.  

Ahh, didn't see it before, but I now see URL_PATTERN. 


I know it's minor, but if you later apply
Perl5Compiler.CASE_INSENSITIVE_MASK, you don't need to
do a-zA-Z.

For me, there are two groups of issues.

1.  URL_PATTERN issues.  URL_PATTERN matches things
that aren't really links, and doesn't match things
that are links.  This is what you talk about below,
and what your JIRA issue covers.

It appears that URL_PATTERN aims to match anything
that looks like a link on the page, whether in a tag
or not.    Wondering about finding tags, and then
looking for links in the tags.  Seems a little much to
crawl urls that are plain text and not in a tag, since
they aren't really links.  A simple tag regex in perl
looks like this /<[^>]+?>/.

Extracting key/value pairs is a little harder, but
given $key, the second match from this regex

m/(?<!\w|\.)\Q$key\E\s*=(["']?)(|.*?[^\\])\1(\s|>)/is

returns the value.  This regex has gone through some
rather rigorous testing/use.

Getting $key interpolated into the regex seems a
little harder in java, but it seems we would mostly be
looking for href, and src, and maybe not even src.

The regex for href looks like this

m/(?<!\w|\.)href\s*=(["']?)(|.*?[^\\])\1(\s|>)/is

If you commit your TestPattern.java stuff, I would be
happy to add cases and play with the regexes until all
the cases work.

2.  If a link is encountered that throws an exception
anywhere in this call

outlinks.add(new Outlink(url, anchor))

then no more links will get extracted on the page. 
That is what my JIRA issue
(http://issues.apache.org/jira/browse/NUTCH-120)
covers.

Thanks for looking into this.

Earl


                
__________________________________ 
Yahoo! Music Unlimited 
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/

Re: crawl problems (a bug/patch)

Reply via email to