Hi Earl,
--- Earl Cahill <[EMAIL PROTECTED]> wrote:
> I know it's minor, but if you later apply
> Perl5Compiler.CASE_INSENSITIVE_MASK, you don't need to
> do a-zA-Z.
Indeed.
>
> For me, there are two groups of issues.
>
> 1. URL_PATTERN issues. URL_PATTERN matches things
> that aren't really links, and doesn't match things
> that are links. This is what you talk about below,
> and what your JIRA issue covers.
>
> It appears that URL_PATTERN aims to match anything
> that looks like a link on the page, whether in a tag
> or not. Wondering about finding tags, and then
> looking for links in the tags. Seems a little much to
> crawl urls that are plain text and not in a tag, since
> they aren't really links. A simple tag regex in perl
> looks like this /<[^>]+?>/.
Indeed, but you're thinking HTML there. The parsed text can come from
somwhere else: it could be text parsed from a Word Document or a PDF
file (you can find references to getOutlinks in the relevant plugins).
By investing further, I've found that for parse-html, the links are
extracted differently: the links are returned by
DOMContentUtils.getOutlinks based upon Neko, which therefore makes me
wonder how you get to extract links with OutlinkExtractor instead...
Regards,
Sébastien.
___________________________________________________________________________
Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger
Téléchargez cette version sur http://fr.messenger.yahoo.com