subject:"\"Nutch Crawl a Specific List Of URLs \\\(150K\\\)\""

RE: Nutch Crawl a Specific List Of URLs (150K)

2014-01-03 Thread Markus Jelsma

: Nutch Crawl a Specific List Of URLs (150K) Thanks for all the response, they are very inspiring and diving into the log level is very beneficial to learn Nutch. The fact is that I use Python BeautifulSoup to parse the sitemap of my targeted website, which comes up with those 150K URLs, however, it

Re: Nutch Crawl a Specific List Of URLs (150K)

2014-01-02 Thread Bin Wang

y it likely stopped when it did. > > Cheers > > -Original message- > From: Bin Wang > Sent: Friday 27th December 2013 19:50 > To: dev@nutch.apache.org > Subject: Nutch Crawl a Specific List Of URLs (150K) > > Hi, > > I have a very sp

RE: Nutch Crawl a Specific List Of URLs (150K)

2013-12-30 Thread Markus Jelsma

ll you how much it crawled and why it likely stopped when it did. Cheers -Original message- From: Bin Wang Sent: Friday 27th December 2013 19:50 To: dev@nutch.apache.org Subject: Nutch Crawl a Specific List Of URLs (150K) Hi, I have a very specific list of URLs, which is about 140

Re: Nutch Crawl a Specific List Of URLs (150K)

2013-12-29 Thread Tejas Patil

Hi Bin Wang, >> nohup bin/nutch crawl urls -dir result -depth 1 -topN 20 & You were creating a new crawldb or reusing some old one ? Were you running this on a cluster or in local mode ? Was there any failure due to which the fetch round got aborted ? (see logs for this). I would like to rep

Re: Nutch Crawl a Specific List Of URLs (150K)

2013-12-28 Thread Talat Uyarer

Hi Bin, You have interesting error. I don't use 1.7 but I can try with screen command. I believe you will not get same error. Talat 2013/12/27 Bin Wang > Hi, > > I have a very specific list of URLs, which is about 140K URLs. > > I switch off the `db.update.additions.allowed` so it will not up

Nutch Crawl a Specific List Of URLs (150K)

2013-12-27 Thread Bin Wang

Hi, I have a very specific list of URLs, which is about 140K URLs. I switch off the `db.update.additions.allowed` so it will not update the crawldb... and I was assuming I can feed all the URLs to Nutch, and after one round of fetching, it will finish and leave all the raw HTML files in the segme

RE: Nutch Crawl a Specific List Of URLs (150K)

Re: Nutch Crawl a Specific List Of URLs (150K)

RE: Nutch Crawl a Specific List Of URLs (150K)

Re: Nutch Crawl a Specific List Of URLs (150K)

Re: Nutch Crawl a Specific List Of URLs (150K)

Nutch Crawl a Specific List Of URLs (150K)

6 matches

Site Navigation

Mail list logo

Footer information