Karen Church wrote:
Hi Andrzej,
Thanks for the reply. Regarding the outlink limit - I thought it was a
limit of 100 outlinks per page by default? And in these cases the
first 100 outlinks are stored. I have a few pages like this in the
crawl database. The problem I'm having is the outlink object is empty
for a some pages when on previous days the outlink object wasn't empty
and contained outlinks.
Ok, it's clear now.
At the moment I'm using the following code in my FOR loop while
reading the segment to make sure that I ignore pages that couldn't be
fetched and pages that could not be parsed....
if(fetcherOutput.getStatus() != FetcherOutput.SUCCESS)
{
continue;
}
I've also checked the status of a couple of pages whose outlinks are
missing and they all appear to have a SUCCESS status.
My point was that there is another status (ParseData.status) which you
should check - the absence of outlinks indicates that there were
problems in parsing the page. Can you see things like page title,
metadata etc. under ParseData section in the segread output? Can you
also see the page content, to confirm that it was fetched properly?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com