Re: crawl problems (a bug/patch)

Earl Cahill Thu, 20 Oct 2005 11:09:42 -0700

Should I submit this through JIRA?

Earl


--- Earl Cahill <[EMAIL PROTECTED]> wrote:

> Still tracking down a solution, but my problems
> appear
> to be parsing based.
> 
> My page has this tag
> 
> <div class="content" id="content"
> style="display:none;">
> 
> The div starts without display and then javascript
> brings in a template and displays the div.  I think
> this totally legitimate, and yahoo!, google and msn
> all seem to agree.
> 
> For whatever reason, nutch extracts display:none as
> a
> url.  Am still digging, but haven't figured that
> part
> out.
> 
> display:none gets passed to
> org.apache.nutch.net.BasicUrlNormalizer, where this
> line
> 
> URL url = new URL(urlString);
> 
> appears to throw a MalformedURLException, likely
> because display:none isn't much of a URL.
> 
> After this, no other links get processed on the
> page. 
> 
> 
> The try is around extracting links for the whole
> page,
> and as soon as an exception is thrown, the link
> extraction stops.  This seems a little harsh,
> especially since nutch seems perhaps a little naive
> here.  I propose to try each call to
> outlinks.add(new
> Outlink(url, anchor)).  Then if there is a problem
> with any single url, parsing continues.  The patch
> below does such a thing.
> 
> Many more links on my page get processed, but nutch
> still doesn't find
> 
> <a href=/sitemap.html>browse</a>
> 
> and I am not sure why.
> 
> This little patch seems like a pretty huge deal, and
> I
> really can't believe that no one else has discovered
> it.  One "bad" link and the rest of the page gets
> thrown away?  If nothing else, doesn't anyone else
> use
> styles?  It seems like any page with any div, with
> any
>  style attribute that isn't a real link would have
> the
> same result.
> 
> Maybe the thinking was that if a page has a bad
> link,
> that is reason enough to skip a head.  I could buy
> that a whole lot more if the parsing were more
> mature.
> 
> I just looked and the mapreduce branch has the exact
> same code, so, the patch should work for both.
> 
> So, three open questions
> 
> 1.  Why doesn't my link (<a
> href=/sitemap.html>browse</a>) get parsed?
> 2.  Why does my style get followed?
> 3.  Where do I look for a list of all the failed
> links?
> 
> Thanks,
> Earl
> 
> Index:
>
src/java/org/apache/nutch/parse/OutlinkExtractor.java
>
===================================================================
> ---
>
src/java/org/apache/nutch/parse/OutlinkExtractor.java
> 
>   (revision 326762)
> +++
>
src/java/org/apache/nutch/parse/OutlinkExtractor.java
> 
>   (working copy)
> @@ -97,7 +97,11 @@
>        while (matcher.contains(input, pattern)) {
>          result = matcher.getMatch();
>          url = result.group(0);
> -        outlinks.add(new Outlink(url, anchor));
> +        try {
> +          outlinks.add(new Outlink(url, anchor));
> +        } catch (Exception ex) {
> +         
> LOG.throwing(OutlinkExtractor.class.getName(),
> "getOutlinks", ex);
> +        }
>        }
>      } catch (Exception ex) {
>        // if it is a malformed URL we just throw it
> away and continue with
> 
> 
> 
>               
> __________________________________ 
> Yahoo! Music Unlimited 
> Access over 1 million songs. Try it free.
> http://music.yahoo.com/unlimited/
> 



        
                
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

Re: crawl problems (a bug/patch)

Reply via email to