Michael Stack wrote:
> The javascript parser will often add the discovered URL as its anchor 
> text (See below linkdb dump for examples).  These urls-as-anchor text 
> are tokenized when indexing and then, because anchors by default get a 
> hefty boost at query time, the URL-found-by-the-parse-js-plugin can 
> show high in search results.
>
> Is adding the URL as anchor intentional?  To me it looks like anchor 
> text pollution (or, if not, to be consistent, anytime there is empty 
> anchor text, we should just add inlink URL).

When I initially wrote this plugin I thought that providing at least 
some anchor text is better than no text at all - but now I don't think 
so anymore, exactly because of this reason ... So I agree, we should 
just put empty String if there's no anchor.


>
> Also, while on the parse-js, I see a lot of ERROR-level logging 
> complaining of malformed URLs (See below for example);  the matches 
> from a regex over javascript content are being passed to java.net.URL 
> for it to figure whether the javascript substring is a likely URL or 
> not.  Should these messages be logged instead at INFO level (Or the 
> regex tightened up so more likely the string passed is actually an URL).
>
> I'll make patches dependent on feedback.

Any improvements here to extract more likely URLs are welcome. I'd 
rather see them as additional hardcoded rules than an expanded regex, if 
it makes sense -  complex regexes often misbehave on long texts,  either 
hanging or slowing down and consuming 100% CPU.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to