Re: crawl problems (a bug/patch)
Hi Sébastien, Yahoo! just hosed my message, glad I had it elsewhere. As you probably saw in the OutlinkExtractor class, the links are extracted with a Regexp. Ahh, didn't see it before, but I now see URL_PATTERN. I know it's minor, but if you later apply Perl5Compiler.CASE_INSENSITIVE_MASK, you don't need to do a-zA-Z. For me, there are two groups of issues. 1. URL_PATTERN issues. URL_PATTERN matches things that aren't really links, and doesn't match things that are links. This is what you talk about below, and what your JIRA issue covers. It appears that URL_PATTERN aims to match anything that looks like a link on the page, whether in a tag or not.Wondering about finding tags, and then looking for links in the tags. Seems a little much to crawl urls that are plain text and not in a tag, since they aren't really links. A simple tag regex in perl looks like this /[^]+?/. Extracting key/value pairs is a little harder, but given $key, the second match from this regex m/(?!\w|\.)\Q$key\E\s*=([']?)(|.*?[^\\])\1(\s|)/is returns the value. This regex has gone through some rather rigorous testing/use. Getting $key interpolated into the regex seems a little harder in java, but it seems we would mostly be looking for href, and src, and maybe not even src. The regex for href looks like this m/(?!\w|\.)href\s*=([']?)(|.*?[^\\])\1(\s|)/is If you commit your TestPattern.java stuff, I would be happy to add cases and play with the regexes until all the cases work. 2. If a link is encountered that throws an exception anywhere in this call outlinks.add(new Outlink(url, anchor)) then no more links will get extracted on the page. That is what my JIRA issue (http://issues.apache.org/jira/browse/NUTCH-120) covers. Thanks for looking into this. Earl __ Yahoo! Music Unlimited Access over 1 million songs. Try it free. http://music.yahoo.com/unlimited/
Re: crawl problems (a bug/patch)
By investing further, I've found that for parse-html, the links are extracted differently: the links are returned by DOMContentUtils.getOutlinks based upon Neko, which therefore makes me wonder how you get to extract links with OutlinkExtractor instead... Earl, which Nutch version do you use? If your links are extracted by the OutlinkExtractor, it seems that it is not the HtmlParser that is used to parse your document, but the TextParser instead (the default one). There were a bug concerning the content-types with parameters such as text/html; charset=iso-8859-1. Moreover your site return such a content-type, so that the ParserFactory doesn't correctly find the good parser (HtmlParser), but uses the default one. This issue is fixed in trunk and mapred. For further details, see the thread on nutch-dev mailing list : http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00791.html or the NUTCH-88 issue : http://issues.apache.org/jira/browse/NUTCH-88 Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: crawl problems (a bug/patch)
Jérôme, which Nutch version do you use? Kind of gave up on mapred for awhile, so I am using trunk. There were a bug concerning the content-types with parameters such as text/html; charset=iso-8859-1. Yeah, when I telnet in to GET / shopthar.com, I get Content-Type: text/html; charset=iso-8859-1 This issue is fixed in trunk and mapred. Hmm, well, I was seeing something earlier in trunk. That said, something happened and I now seem to get a partial crawl started. How very strange. I did catch a few updates today, but the commits sure didn't seem related. Now I crawl for awhile, and then it just stops. I still get new segments starting, but no new http hits to the server. So looks like I have something new to track down. But yeah, when it is going, it can hammer pretty good. Earl __ Yahoo! FareChase: Search multiple travel sites in one click. http://farechase.yahoo.com