The javascript parser will often add the discovered URL as its anchor text (See below linkdb dump for examples). These urls-as-anchor text are tokenized when indexing and then, because anchors by default get a hefty boost at query time, the URL-found-by-the-parse-js-plugin can show high in search results.
Is adding the URL as anchor intentional? To me it looks like anchor text pollution (or, if not, to be consistent, anytime there is empty anchor text, we should just add inlink URL). Also, while on the parse-js, I see a lot of ERROR-level logging complaining of malformed URLs (See below for example); the matches from a regex over javascript content are being passed to java.net.URL for it to figure whether the javascript substring is a likely URL or not. Should these messages be logged instead at INFO level (Or the regex tightened up so more likely the string passed is actually an URL). I'll make patches dependent on feedback. Thanks all, St.Ack Here is an example of links as anchors from a linkdb dump: c=test,u=https://www2.westpac.com.au/emarket/check_merch.cfm?id=900030 Inlinks: fromUrl: http://premier.ticketek.com.au/content/buyers/buyers_step1.aspx anchor: http://premier.ticketek.com.au/content/buyers/buyers_step1.aspx fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_qld.aspx anchor: http://premier.ticketek.com.au/content/outlets/agencies_qld.aspx fromUrl: http://premier.ticketek.com.au/shows/show.aspx?sh=TSSWANS05 anchor: http://premier.ticketek.com.au/shows/show.aspx?sh=TSSWANS05 fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_vic.aspx anchor: http://premier.ticketek.com.au/content/outlets/agencies_vic.aspx fromUrl: http://premier.ticketek.com.au/Venues/VenueDetails.aspx?v=NMO&s=6547 anchor: http://premier.ticketek.com.au/Venues/VenueDetails.aspx?v=NMO&s=6547 fromUrl: http://premier.ticketek.com.au/content/buyers/buyers_step5.aspx anchor: http://premier.ticketek.com.au/content/buyers/buyers_step5.aspx fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_nsw.aspx anchor: http://premier.ticketek.com.au/content/outlets/agencies_nsw.aspx Here are the exceptions I see: 06/12/29 14:17:14 ERROR js.JSParseFilter: getJSLinks java.net.MalformedURLException: unknown protocol: javascript at java.net.URL.<init>(URL.java:574) at java.net.URL.<init>(URL.java:464) at org.apache.nutch.parse.js.JSParseFilter.getJSLinks(JSParseFilter.java:215) at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:109) at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:139) at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:139) at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:139) at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:139) at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:139) at org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:139) at org.apache.nutch.parse.js.JSParseFilter.filter(JSParseFilter.java:73) at org.apache.nutch.parse.HtmlParseFilters.filter(HtmlParseFilters.java:62) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:220) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84) at org.archive.access.nutch.ImportArcs.processRecord(ImportArcs.java:579) at org.archive.access.nutch.ImportArcs$IndexingThread.run(ImportArcs.java:375) ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
