Hi everyone here:
I have two questions which confused me for weeks. If anyone here can
help me, thanks so much!
The first one, I know that Nutch won't store the HTTP code at all.
Instead, it encodes it as a single status byte. If Nutch fetches a bad link
whose HTTP status is not 200(e.g. 203 307 404 ...) or fetches a link which
is robots denied or throttled by website because of frequently fetch. How
can we distinguish between these conditions from that status byte(e.g.
db_status_gone, db_redir_temp)?
Second, I know a little about Ranking & Scoring mechanism in Nutch. I
know linkrank algorithm is the main algorithm. The linkrank algorithm is
just a single score factor in the index system of Nutch, what is other
factors about index and search in Nutch? The webgraph has not yet been
ported to the GORA-based API in Nutch 2.0. What is the result if we index
and search in Nutch 2.0?
--
View this message in context:
http://lucene.472066.n3.nabble.com/Two-questions-about-Nutch-tp4002589.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.