Re: parse-js as a HtmlParseFilter

Andrzej Bialecki Sat, 30 Dec 2006 02:04:49 -0800

Michael Stack wrote:

The javascript parser will often add the discovered URL as its anchortext (See below linkdb dump for examples). These urls-as-anchor textare tokenized when indexing and then, because anchors by default get ahefty boost at query time, the URL-found-by-the-parse-js-plugin canshow high in search results.
Is adding the URL as anchor intentional? To me it looks like anchortext pollution (or, if not, to be consistent, anytime there is emptyanchor text, we should just add inlink URL).

When I initially wrote this plugin I thought that providing at leastsome anchor text is better than no text at all - but now I don't thinkso anymore, exactly because of this reason ... So I agree, we shouldjust put empty String if there's no anchor.

Also, while on the parse-js, I see a lot of ERROR-level loggingcomplaining of malformed URLs (See below for example); the matchesfrom a regex over javascript content are being passed to java.net.URLfor it to figure whether the javascript substring is a likely URL ornot. Should these messages be logged instead at INFO level (Or theregex tightened up so more likely the string passed is actually an URL).
I'll make patches dependent on feedback.

Any improvements here to extract more likely URLs are welcome. I'drather see them as additional hardcoded rules than an expanded regex, ifit makes sense - complex regexes often misbehave on long texts, eitherhanging or slowing down and consuming 100% CPU.


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: parse-js as a HtmlParseFilter

Reply via email to