Depending on whether you have deletes/updates, if this is an ad-hoc thing, you 
might want to just read the ss tables directly.

> On Feb 9, 2015, at 12:56 PM, Kevin Burton <bur...@spinn3r.com> wrote:
> 
> I had considered using spark for this but:
> 
> 1.  we tried to deploy spark only to find out that it was missing a number of 
> key things we need.  
> 
> 2.  our app needs to shut down to release threads and resources.  Spark 
> doesn’t have support for this so all the workers would have stale thread 
> leaking afterwards.  Though I guess if I can get workers to fork then I 
> should be ok.
> 
> 3.  Spark SQL actually returned invalid data to our queries… so that was kind 
> of a red flag and a non-starter
> 
> On Mon, Feb 9, 2015 at 2:24 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
> <mvallemil...@bloomberg.net <mailto:mvallemil...@bloomberg.net>> wrote:
> Just for the record, I was doing the exact same thing in an internal 
> application in the start up I used to work. We have had the need of writing 
> custom code process in parallel all rows of a column family. Normally we 
> would use Spark for the job, but in our case the logic was a little more 
> complicated, so we wrote custom code. 
> 
> What we did was to run N process in M machines (N cores in each), each one 
> processing tasks. The tasks were created by splitting the range -2^ 63 to 2^ 
> 63 -1 in N*M*10 tasks. Even if data was not completely distributed along the 
> tasks, no machines were idle, as when some task was completed another one was 
> taken from the task pool.
> 
> It was fast enough for us, but I am interested in knowing if there is a 
> better way of doing it.
> 
> For your specific case, here is a tool we had opened as open source and can 
> be useful for simpler tests: https://github.com/s1mbi0se/cql_record_processor 
> <https://github.com/s1mbi0se/cql_record_processor>
> 
> Also, I guess you probably know that, but I would consider using Spark for 
> doing this.
> 
> Best regards,
> Marcelo.
> 
> From: user@cassandra.apache.org <mailto:user@cassandra.apache.org> 
> Subject: Re:Fastest way to map/parallel read all values in a table?
> What’s the fastest way to map/parallel read all values in a table?
> 
> Kind of like a mini map only job.
> 
> I’m doing this to compute stats across our entire corpus.
> 
> What I did to begin with was use token() and then spit it into the number of 
> splits I needed.
> 
> So I just took the total key range space which is -2^63 to 2^63 - 1 and broke 
> it into N parts.
> 
> Then the queries come back as:
> 
> select * from mytable where token(primaryKey) >= x and token(primaryKey) < y
> 
> From reading on this list I thought this was the correct way to handle this 
> problem.
> 
> However, I’m seeing horrible performance doing this.  After about 1% it just 
> flat out locks up.
> 
> Could it be that I need to randomize the token order so that it’s not 
> contiguous?  Maybe it’s all mapping on the first box to begin with.
> 
> 
> 
> -- 
> 
> Founder/CEO Spinn3r.com <http://spinn3r.com/>
> Location: San Francisco, CA
> blog: http://burtonator.wordpress.com <http://burtonator.wordpress.com/>
> … or check out my Google+ profile 
> <https://plus.google.com/102718274791889610666/posts>
>  <http://spinn3r.com/>
> 
> 
> 
> -- 
> 
> Founder/CEO Spinn3r.com <http://spinn3r.com/>
> Location: San Francisco, CA
> blog: http://burtonator.wordpress.com <http://burtonator.wordpress.com/>
> … or check out my Google+ profile 
> <https://plus.google.com/102718274791889610666/posts>
>  <http://spinn3r.com/>
> 

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to