----- Original Message -----
From: "Andrzej Bialecki" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Monday, November 07, 2005 11:40 AM
Subject: Re: Outlinks?
Karen Church wrote:
Hi Andrzej,
Thanks for the reply. Regarding the outlink limit - I thought it was a
limit of 100 outlinks per page by default? And in these cases the first
100 outlinks are stored. I have a few pages like this in the crawl
database. The problem I'm having is the outlink object is empty for a
some pages when on previous days the outlink object wasn't empty and
contained outlinks.
Ok, it's clear now.
At the moment I'm using the following code in my FOR loop while reading
the segment to make sure that I ignore pages that couldn't be fetched and
pages that could not be parsed....
if(fetcherOutput.getStatus() != FetcherOutput.SUCCESS)
{
continue;
}
I've also checked the status of a couple of pages whose outlinks are
missing and they all appear to have a SUCCESS status.
My point was that there is another status (ParseData.status) which you
should check - the absence of outlinks indicates that there were problems
in parsing the page. Can you see things like page title, metadata etc.
under ParseData section in the segread output? Can you also see the page
content, to confirm that it was fetched properly?
I didn't realize there was a ParseData.status. At the moment I'm not
checking the ParseData status but I've just checked and for the pages with
missing outlinks I can see the content (parsed text) and the metadata of the
page but the title's are blank when they previously were not. It definitely
points to a parsing error, however, I'm using version 6 of nutch which
doesn't support ParseData.status.
Also, this isn't a problem with the HTML parser provided with Nutch - this
is a parser I wrote for WML pages so it could well be a problem with this.
It's just strange that the title and outlinks are present on one day and
gone the next, even though the content and metadata remains untouched. This
obviously points to errors in my code - I'll have to look into this in more
detail....
Thanks and regards,
Karen
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com