Re: crawl problems (a bug/patch)

2005-10-20 Thread Earl Cahill
Hi Sébastien,

Yahoo! just hosed my message, glad I had it elsewhere.

 As you probably saw in the OutlinkExtractor class,
 the links are
 extracted with a Regexp.  

Ahh, didn't see it before, but I now see URL_PATTERN. 


I know it's minor, but if you later apply
Perl5Compiler.CASE_INSENSITIVE_MASK, you don't need to
do a-zA-Z.

For me, there are two groups of issues.

1.  URL_PATTERN issues.  URL_PATTERN matches things
that aren't really links, and doesn't match things
that are links.  This is what you talk about below,
and what your JIRA issue covers.

It appears that URL_PATTERN aims to match anything
that looks like a link on the page, whether in a tag
or not.Wondering about finding tags, and then
looking for links in the tags.  Seems a little much to
crawl urls that are plain text and not in a tag, since
they aren't really links.  A simple tag regex in perl
looks like this /[^]+?/.

Extracting key/value pairs is a little harder, but
given $key, the second match from this regex

m/(?!\w|\.)\Q$key\E\s*=([']?)(|.*?[^\\])\1(\s|)/is

returns the value.  This regex has gone through some
rather rigorous testing/use.

Getting $key interpolated into the regex seems a
little harder in java, but it seems we would mostly be
looking for href, and src, and maybe not even src.

The regex for href looks like this

m/(?!\w|\.)href\s*=([']?)(|.*?[^\\])\1(\s|)/is

If you commit your TestPattern.java stuff, I would be
happy to add cases and play with the regexes until all
the cases work.

2.  If a link is encountered that throws an exception
anywhere in this call

outlinks.add(new Outlink(url, anchor))

then no more links will get extracted on the page. 
That is what my JIRA issue
(http://issues.apache.org/jira/browse/NUTCH-120)
covers.

Thanks for looking into this.

Earl



__ 
Yahoo! Music Unlimited 
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/


Re: crawl problems (a bug/patch)

2005-10-20 Thread Jérôme Charron
 By investing further, I've found that for parse-html, the links are
 extracted differently: the links are returned by
 DOMContentUtils.getOutlinks based upon Neko, which therefore makes me
 wonder how you get to extract links with OutlinkExtractor instead...

Earl,

which Nutch version do you use?
If your links are extracted by the OutlinkExtractor, it seems that it is not
the HtmlParser that is used to parse your document, but the TextParser
instead (the default one).
There were a bug concerning the content-types with parameters such as
text/html; charset=iso-8859-1.
Moreover your site return such a content-type, so that the ParserFactory
doesn't correctly find the good parser (HtmlParser), but uses the default
one.
This issue is fixed in trunk and mapred.
For further details,
see the thread on nutch-dev mailing list :
http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00791.html
or the NUTCH-88 issue : http://issues.apache.org/jira/browse/NUTCH-88

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: crawl problems (a bug/patch)

2005-10-20 Thread Earl Cahill
Jérôme,

 which Nutch version do you use?

Kind of gave up on mapred for awhile, so I am using
trunk.

 There were a bug concerning the content-types with
 parameters such as
 text/html; charset=iso-8859-1.

Yeah, when I telnet in to GET / shopthar.com, I get

Content-Type: text/html; charset=iso-8859-1

 This issue is fixed in trunk and mapred.

Hmm, well, I was seeing something earlier in trunk. 
That said, something happened and I now seem to get a
partial crawl started.  How very strange.  I did catch
a few updates today, but the commits sure didn't seem
related.

Now I crawl for awhile, and then it just stops.  I
still get new segments starting, but no new http hits
to the server.  So looks like I have something new to
track down.  But yeah, when it is going, it can hammer
pretty good.

Earl



__ 
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com