Hi Earl,
--- Earl Cahill <[EMAIL PROTECTED]> wrote: > I know it's minor, but if you later apply > Perl5Compiler.CASE_INSENSITIVE_MASK, you don't need to > do a-zA-Z. Indeed. > > For me, there are two groups of issues. > > 1. URL_PATTERN issues. URL_PATTERN matches things > that aren't really links, and doesn't match things > that are links. This is what you talk about below, > and what your JIRA issue covers. > > It appears that URL_PATTERN aims to match anything > that looks like a link on the page, whether in a tag > or not. Wondering about finding tags, and then > looking for links in the tags. Seems a little much to > crawl urls that are plain text and not in a tag, since > they aren't really links. A simple tag regex in perl > looks like this /<[^>]+?>/. Indeed, but you're thinking HTML there. The parsed text can come from somwhere else: it could be text parsed from a Word Document or a PDF file (you can find references to getOutlinks in the relevant plugins). By investing further, I've found that for parse-html, the links are extracted differently: the links are returned by DOMContentUtils.getOutlinks based upon Neko, which therefore makes me wonder how you get to extract links with OutlinkExtractor instead... Regards, Sébastien. ___________________________________________________________________________ Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger Téléchargez cette version sur http://fr.messenger.yahoo.com