[Nutch-general] parse-js as a HtmlParseFilter

Michael Stack Fri, 29 Dec 2006 17:04:54 -0800

The javascript parser will often add the discovered URL as its anchor 
text (See below linkdb dump for examples).  These urls-as-anchor text 
are tokenized when indexing and then, because anchors by default get a 
hefty boost at query time, the URL-found-by-the-parse-js-plugin can show 
high in search results.


Is adding the URL as anchor intentional?  To me it looks like anchor 
text pollution (or, if not, to be consistent, anytime there is empty 
anchor text, we should just add inlink URL).

Also, while on the parse-js, I see a lot of ERROR-level logging 
complaining of malformed URLs (See below for example);  the matches from 
a regex over javascript content are being passed to java.net.URL for it 
to figure whether the javascript substring is a likely URL or not.  
Should these messages be logged instead at INFO level (Or the regex 
tightened up so more likely the string passed is actually an URL).

I'll make patches dependent on feedback.
Thanks all,
St.Ack


Here is an example of links as anchors from a linkdb dump:

c=test,u=https://www2.westpac.com.au/emarket/check_merch.cfm?id=900030  
Inlinks:
 fromUrl: 
http://premier.ticketek.com.au/content/buyers/buyers_step1.aspx anchor: 
http://premier.ticketek.com.au/content/buyers/buyers_step1.aspx
 fromUrl: 
http://premier.ticketek.com.au/content/outlets/agencies_qld.aspx anchor: 
http://premier.ticketek.com.au/content/outlets/agencies_qld.aspx
 fromUrl: http://premier.ticketek.com.au/shows/show.aspx?sh=TSSWANS05 
anchor: http://premier.ticketek.com.au/shows/show.aspx?sh=TSSWANS05
 fromUrl: 
http://premier.ticketek.com.au/content/outlets/agencies_vic.aspx anchor: 
http://premier.ticketek.com.au/content/outlets/agencies_vic.aspx
 fromUrl: 
http://premier.ticketek.com.au/Venues/VenueDetails.aspx?v=NMO&s=6547 
anchor: http://premier.ticketek.com.au/Venues/VenueDetails.aspx?v=NMO&s=6547
 fromUrl: 
http://premier.ticketek.com.au/content/buyers/buyers_step5.aspx anchor: 
http://premier.ticketek.com.au/content/buyers/buyers_step5.aspx
 fromUrl: 
http://premier.ticketek.com.au/content/outlets/agencies_nsw.aspx anchor: 
http://premier.ticketek.com.au/content/outlets/agencies_nsw.aspx


Here are the exceptions I see:

06/12/29 14:17:14 ERROR js.JSParseFilter: getJSLinks
java.net.MalformedURLException: unknown protocol: javascript
        at java.net.URL.<init>(URL.java:574)
        at java.net.URL.<init>(URL.java:464)
        at 
org.apache.nutch.parse.js.JSParseFilter.getJSLinks(JSParseFilter.java:215)
        at 
org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:109)
        at 
org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:139)
        at 
org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:139)
        at 
org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:139)
        at 
org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:139)
        at 
org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:139)
        at 
org.apache.nutch.parse.js.JSParseFilter.walk(JSParseFilter.java:139)
        at 
org.apache.nutch.parse.js.JSParseFilter.filter(JSParseFilter.java:73)
        at 
org.apache.nutch.parse.HtmlParseFilters.filter(HtmlParseFilters.java:62)
        at 
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:220)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84)
        at 
org.archive.access.nutch.ImportArcs.processRecord(ImportArcs.java:579)
        at 
org.archive.access.nutch.ImportArcs$IndexingThread.run(ImportArcs.java:375)



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] parse-js as a HtmlParseFilter

Reply via email to