Still tracking down a solution, but my problems appear to be parsing based.
My page has this tag <div class="content" id="content" style="display:none;"> The div starts without display and then javascript brings in a template and displays the div. I think this totally legitimate, and yahoo!, google and msn all seem to agree. For whatever reason, nutch extracts display:none as a url. Am still digging, but haven't figured that part out. display:none gets passed to org.apache.nutch.net.BasicUrlNormalizer, where this line URL url = new URL(urlString); appears to throw a MalformedURLException, likely because display:none isn't much of a URL. After this, no other links get processed on the page. The try is around extracting links for the whole page, and as soon as an exception is thrown, the link extraction stops. This seems a little harsh, especially since nutch seems perhaps a little naive here. I propose to try each call to outlinks.add(new Outlink(url, anchor)). Then if there is a problem with any single url, parsing continues. The patch below does such a thing. Many more links on my page get processed, but nutch still doesn't find <a href=/sitemap.html>browse</a> and I am not sure why. This little patch seems like a pretty huge deal, and I really can't believe that no one else has discovered it. One "bad" link and the rest of the page gets thrown away? If nothing else, doesn't anyone else use styles? It seems like any page with any div, with any style attribute that isn't a real link would have the same result. Maybe the thinking was that if a page has a bad link, that is reason enough to skip a head. I could buy that a whole lot more if the parsing were more mature. I just looked and the mapreduce branch has the exact same code, so, the patch should work for both. So, three open questions 1. Why doesn't my link (<a href=/sitemap.html>browse</a>) get parsed? 2. Why does my style get followed? 3. Where do I look for a list of all the failed links? Thanks, Earl Index: src/java/org/apache/nutch/parse/OutlinkExtractor.java =================================================================== --- src/java/org/apache/nutch/parse/OutlinkExtractor.java (revision 326762) +++ src/java/org/apache/nutch/parse/OutlinkExtractor.java (working copy) @@ -97,7 +97,11 @@ while (matcher.contains(input, pattern)) { result = matcher.getMatch(); url = result.group(0); - outlinks.add(new Outlink(url, anchor)); + try { + outlinks.add(new Outlink(url, anchor)); + } catch (Exception ex) { + LOG.throwing(OutlinkExtractor.class.getName(), "getOutlinks", ex); + } } } catch (Exception ex) { // if it is a malformed URL we just throw it away and continue with __________________________________ Yahoo! Music Unlimited Access over 1 million songs. Try it free. http://music.yahoo.com/unlimited/