Hi Earl,

--- Earl Cahill <[EMAIL PROTECTED]> wrote:

> I know it's minor, but if you later apply
> Perl5Compiler.CASE_INSENSITIVE_MASK, you don't need to
> do a-zA-Z.


> For me, there are two groups of issues.
> 1.  URL_PATTERN issues.  URL_PATTERN matches things
> that aren't really links, and doesn't match things
> that are links.  This is what you talk about below,
> and what your JIRA issue covers.
> It appears that URL_PATTERN aims to match anything
> that looks like a link on the page, whether in a tag
> or not.    Wondering about finding tags, and then
> looking for links in the tags.  Seems a little much to
> crawl urls that are plain text and not in a tag, since
> they aren't really links.  A simple tag regex in perl
> looks like this /<[^>]+?>/.

Indeed, but you're thinking HTML there.  The parsed text can come from
somwhere else:  it could be text parsed from a Word Document or a PDF
file (you can find references to getOutlinks in the relevant plugins).

By investing further, I've found that for parse-html, the links are
extracted differently: the links are returned by
DOMContentUtils.getOutlinks based upon Neko, which therefore makes me
wonder how you get to extract links with OutlinkExtractor instead...



Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger 
Téléchargez cette version sur http://fr.messenger.yahoo.com

Reply via email to