Hi Sébastien, Yahoo! just hosed my message, glad I had it elsewhere.
> As you probably saw in the OutlinkExtractor class, > the links are > extracted with a Regexp. Ahh, didn't see it before, but I now see URL_PATTERN. I know it's minor, but if you later apply Perl5Compiler.CASE_INSENSITIVE_MASK, you don't need to do a-zA-Z. For me, there are two groups of issues. 1. URL_PATTERN issues. URL_PATTERN matches things that aren't really links, and doesn't match things that are links. This is what you talk about below, and what your JIRA issue covers. It appears that URL_PATTERN aims to match anything that looks like a link on the page, whether in a tag or not. Wondering about finding tags, and then looking for links in the tags. Seems a little much to crawl urls that are plain text and not in a tag, since they aren't really links. A simple tag regex in perl looks like this /<[^>]+?>/. Extracting key/value pairs is a little harder, but given $key, the second match from this regex m/(?<!\w|\.)\Q$key\E\s*=(["']?)(|.*?[^\\])\1(\s|>)/is returns the value. This regex has gone through some rather rigorous testing/use. Getting $key interpolated into the regex seems a little harder in java, but it seems we would mostly be looking for href, and src, and maybe not even src. The regex for href looks like this m/(?<!\w|\.)href\s*=(["']?)(|.*?[^\\])\1(\s|>)/is If you commit your TestPattern.java stuff, I would be happy to add cases and play with the regexes until all the cases work. 2. If a link is encountered that throws an exception anywhere in this call outlinks.add(new Outlink(url, anchor)) then no more links will get extracted on the page. That is what my JIRA issue (http://issues.apache.org/jira/browse/NUTCH-120) covers. Thanks for looking into this. Earl __________________________________ Yahoo! Music Unlimited Access over 1 million songs. Try it free. http://music.yahoo.com/unlimited/
