Re: crawl problems (a bug/patch)

Sébastien LE CALLONNEC Thu, 20 Oct 2005 14:00:40 -0700

Hi Earl,

--- Earl Cahill <[EMAIL PROTECTED]> wrote:

> I know it's minor, but if you later apply
> Perl5Compiler.CASE_INSENSITIVE_MASK, you don't need to
> do a-zA-Z.

Indeed.

> 
> For me, there are two groups of issues.
> 
> 1.  URL_PATTERN issues.  URL_PATTERN matches things
> that aren't really links, and doesn't match things
> that are links.  This is what you talk about below,
> and what your JIRA issue covers.
> 
> It appears that URL_PATTERN aims to match anything
> that looks like a link on the page, whether in a tag
> or not.    Wondering about finding tags, and then
> looking for links in the tags.  Seems a little much to
> crawl urls that are plain text and not in a tag, since
> they aren't really links.  A simple tag regex in perl
> looks like this /<[^>]+?>/.

Indeed, but you're thinking HTML there.  The parsed text can come from
somwhere else:  it could be text parsed from a Word Document or a PDF
file (you can find references to getOutlinks in the relevant plugins).

By investing further, I've found that for parse-html, the links are
extracted differently: the links are returned by
DOMContentUtils.getOutlinks based upon Neko, which therefore makes me
wonder how you get to extract links with OutlinkExtractor instead...

Regards,
Sébastien.

___________________________________________________________________________ 
Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger 
Téléchargez cette version sur http://fr.messenger.yahoo.com

Re: crawl problems (a bug/patch)

Reply via email to