Should I submit this through JIRA?
Earl
--- Earl Cahill <[EMAIL PROTECTED]> wrote:
> Still tracking down a solution, but my problems
> appear
> to be parsing based.
>
> My page has this tag
>
> <div class="content" id="content"
> style="display:none;">
>
> The div starts without display and then javascript
> brings in a template and displays the div. I think
> this totally legitimate, and yahoo!, google and msn
> all seem to agree.
>
> For whatever reason, nutch extracts display:none as
> a
> url. Am still digging, but haven't figured that
> part
> out.
>
> display:none gets passed to
> org.apache.nutch.net.BasicUrlNormalizer, where this
> line
>
> URL url = new URL(urlString);
>
> appears to throw a MalformedURLException, likely
> because display:none isn't much of a URL.
>
> After this, no other links get processed on the
> page.
>
>
> The try is around extracting links for the whole
> page,
> and as soon as an exception is thrown, the link
> extraction stops. This seems a little harsh,
> especially since nutch seems perhaps a little naive
> here. I propose to try each call to
> outlinks.add(new
> Outlink(url, anchor)). Then if there is a problem
> with any single url, parsing continues. The patch
> below does such a thing.
>
> Many more links on my page get processed, but nutch
> still doesn't find
>
> <a href=/sitemap.html>browse</a>
>
> and I am not sure why.
>
> This little patch seems like a pretty huge deal, and
> I
> really can't believe that no one else has discovered
> it. One "bad" link and the rest of the page gets
> thrown away? If nothing else, doesn't anyone else
> use
> styles? It seems like any page with any div, with
> any
> style attribute that isn't a real link would have
> the
> same result.
>
> Maybe the thinking was that if a page has a bad
> link,
> that is reason enough to skip a head. I could buy
> that a whole lot more if the parsing were more
> mature.
>
> I just looked and the mapreduce branch has the exact
> same code, so, the patch should work for both.
>
> So, three open questions
>
> 1. Why doesn't my link (<a
> href=/sitemap.html>browse</a>) get parsed?
> 2. Why does my style get followed?
> 3. Where do I look for a list of all the failed
> links?
>
> Thanks,
> Earl
>
> Index:
>
src/java/org/apache/nutch/parse/OutlinkExtractor.java
>
===================================================================
> ---
>
src/java/org/apache/nutch/parse/OutlinkExtractor.java
>
> (revision 326762)
> +++
>
src/java/org/apache/nutch/parse/OutlinkExtractor.java
>
> (working copy)
> @@ -97,7 +97,11 @@
> while (matcher.contains(input, pattern)) {
> result = matcher.getMatch();
> url = result.group(0);
> - outlinks.add(new Outlink(url, anchor));
> + try {
> + outlinks.add(new Outlink(url, anchor));
> + } catch (Exception ex) {
> +
> LOG.throwing(OutlinkExtractor.class.getName(),
> "getOutlinks", ex);
> + }
> }
> } catch (Exception ex) {
> // if it is a malformed URL we just throw it
> away and continue with
>
>
>
>
> __________________________________
> Yahoo! Music Unlimited
> Access over 1 million songs. Try it free.
> http://music.yahoo.com/unlimited/
>
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com