On 6/27/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote:
wow, setting db.max.outlinks.per.page immediately fixed my problem. It looks
like I totally mis-diagnosed things.
May I pose two questions:
1) how did you view all the outlinks?
bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser <local_file>
2) how severe is NUTCH-119 - does it occur on a lot of sites?
AFAIK, HtmlParser doesn't extract urls with regexps. Nutch uses a
regexp to extract outlinks from files that have no markup information
(such as plain text). See OutlinkExtractor.java.
----- Original Message ----
From: Doğacan Güney <[EMAIL PROTECTED]>
To: [email protected]
Sent: Tuesday, June 26, 2007 10:56:32 PM
Subject: Re: NUTCH-119 :: how hard to fix
On 6/27/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote:
> I am evaluating nutch+lucene as a crawl and search solution.
>
> However, I am finding major bugs in nutch right off the bat.
>
> In particular, NUTCH-119: nutch is not crawling relative URLs. I have some
discussion of it here:
> http://www.mail-archive.com/[EMAIL PROTECTED]/msg08644.html
>
> Most of the links off www.variety.com, one of my main test sites, have
relative URLs. It seems incredible that nutch, which is capable of mapreduce,
cannot fetch these URLs.
>
> It could be that I would fix this bug if, for other reasons, I decide to go
with nutch+lucene. Has anyone tried fixing this problem? Is it intractable? Or
are the developers, who are just volunteers anyway, more interested in fixing
other problems?
>
> Could someone outline the issue for me a bit more clearly so I would know how
to evaluate it?
Both this one and the other site you were mentioning (sf911truth) have
more than 100 outlinks. Nutch, by default, only stores 100 outlinks
per page (db.max.outlinks.per.page). Link about.html happens to be
105th link or so, so nutch doesn't store it. All you have to do is
either increase db.max.outlinks.per.page or set it to -1 (which
means, store all outlinks).
>
>
>
>
>
____________________________________________________________________________________
> Park yourself in front of a world of choices in alternative vehicles. Visit
the Yahoo! Auto Green Center.
> http://autos.yahoo.com/green_center/
--
Doğacan Güney
____________________________________________________________________________________
Be a better Heartthrob. Get better relationship answers from someone who knows.
Yahoo! Answers - Check it out.
http://answers.yahoo.com/dir/?link=list&sid=396545433
--
Doğacan Güney