Isn't there redudant/wasteful duplication between nutch crawldb and solr index?

Gabriele Kahlout Fri, 15 Jul 2011 17:00:48 -0700

Hello,

I had this draft lurking for a while now, and before archiving for personal 
reference I wondered if it's accurate, and if you recommend posting it to the 
wiki.


Nutch maintains a crawldb (and linkdb, for that matter) of the urls it crawled, 
the fetch status, and the date. This data is maintained beyond fetch so that 
pages may be re-crawled, after the a re-crawling period.
At the same time Solr maintains an inverted index of all the fetched pages.
It'd seem more efficient if nutch relied on the index instead of maintaining 
its own crawldb, to !store the same url twice. 
[BUT THAT'S JUST A KEY/ID, NOT WASTE AT ALL, WOULD ALSO END UP THE SAME IN SOLR]

-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt 
within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < 
Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email does 
not contain a valid code then the email is not received. A valid code starts 
with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ 
L(-[a-z]+[0-9]X)).

Isn't there redudant/wasteful duplication between nutch crawldb and solr index?

Reply via email to