Re: crawl problems (a bug/patch)

Earl Cahill Thu, 20 Oct 2005 11:07:15 -0700

Still tracking down a solution, but my problems appear
to be parsing based.


My page has this tag

<div class="content" id="content"
style="display:none;">

The div starts without display and then javascript
brings in a template and displays the div.  I think
this totally legitimate, and yahoo!, google and msn
all seem to agree.

For whatever reason, nutch extracts display:none as a
url.  Am still digging, but haven't figured that part
out.

display:none gets passed to
org.apache.nutch.net.BasicUrlNormalizer, where this
line

URL url = new URL(urlString);

appears to throw a MalformedURLException, likely
because display:none isn't much of a URL.

After this, no other links get processed on the page. 


The try is around extracting links for the whole page,
and as soon as an exception is thrown, the link
extraction stops.  This seems a little harsh,
especially since nutch seems perhaps a little naive
here.  I propose to try each call to outlinks.add(new
Outlink(url, anchor)).  Then if there is a problem
with any single url, parsing continues.  The patch
below does such a thing.

Many more links on my page get processed, but nutch
still doesn't find

<a href=/sitemap.html>browse</a>

and I am not sure why.

This little patch seems like a pretty huge deal, and I
really can't believe that no one else has discovered
it.  One "bad" link and the rest of the page gets
thrown away?  If nothing else, doesn't anyone else use
styles?  It seems like any page with any div, with any
 style attribute that isn't a real link would have the
same result.

Maybe the thinking was that if a page has a bad link,
that is reason enough to skip a head.  I could buy
that a whole lot more if the parsing were more mature.

I just looked and the mapreduce branch has the exact
same code, so, the patch should work for both.

So, three open questions

1.  Why doesn't my link (<a
href=/sitemap.html>browse</a>) get parsed?
2.  Why does my style get followed?
3.  Where do I look for a list of all the failed
links?

Thanks,
Earl

Index:
src/java/org/apache/nutch/parse/OutlinkExtractor.java
===================================================================
---
src/java/org/apache/nutch/parse/OutlinkExtractor.java 
  (revision 326762)
+++
src/java/org/apache/nutch/parse/OutlinkExtractor.java 
  (working copy)
@@ -97,7 +97,11 @@
       while (matcher.contains(input, pattern)) {
         result = matcher.getMatch();
         url = result.group(0);
-        outlinks.add(new Outlink(url, anchor));
+        try {
+          outlinks.add(new Outlink(url, anchor));
+        } catch (Exception ex) {
+         
LOG.throwing(OutlinkExtractor.class.getName(),
"getOutlinks", ex);
+        }
       }
     } catch (Exception ex) {
       // if it is a malformed URL we just throw it
away and continue with



                
__________________________________ 
Yahoo! Music Unlimited 
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/

Re: crawl problems (a bug/patch)

Reply via email to