P.S. I'm still modifying. On Sun, Mar 27, 2011 at 2:34 PM, Julien Nioche < [email protected]> wrote:
> Gabriele, > > I think it is a good idea to have a script like this however your proposal > could be improved. It currently works only on a single machine and uses > commands such as mv, ls etc... which won't work on a pseudo or fully > distributed cluster. You should use the 'hadoop fs' commands instead. > Okay, let's go for 3 editions: 1. that's abridged and works only with solr (tersest script) 2. unabridged with local fs - for begginners 3. hadoop unabridged > If I understand the script correctly, you then merge different crawldbs. > Why do you do that? There should be one crawldb per crawl so I don't think > this is at all necessary. > > So that I get a single dump with info about all the urls crawled. On the scale of the web this is probably a bad idea, isn't it? But then how else could you inspect all the crawled urls at once? > Having a script would definitely be a plus for beginners and would give > more flexibility than the crawl command. > I as the first of beginners. Crawl is not recommended for whole-web crawling i guess because it doesn't work incrementally. Why not add such option to crawl? Shall I feature-request/patch for that? Thanks > > Julien > > P.S. I'm still modying the page. > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).

