Re:Fastest way to map/parallel read all values in a table?

Marcelo Valle (BLOOMBERG/ LONDON) Mon, 09 Feb 2015 02:25:08 -0800

Just for the record, I was doing the exact same thing in an internal 
application in the start up I used to work. We have had the need of writing 
custom code process in parallel all rows of a column family. Normally we would 
use Spark for the job, but in our case the logic was a little more complicated, 
so we wrote custom code.


What we did was to run N process in M machines (N cores in each), each one 
processing tasks. The tasks were created by splitting the range -2^ 63 to 2^ 63 
-1 in N*M*10 tasks. Even if data was not completely distributed along the 
tasks, no machines were idle, as when some task was completed another one was 
taken from the task pool.

It was fast enough for us, but I am interested in knowing if there is a better 
way of doing it.

For your specific case, here is a tool we had opened as open source and can be 
useful for simpler tests: https://github.com/s1mbi0se/cql_record_processor

Also, I guess you probably know that, but I would consider using Spark for 
doing this.

Best regards,
Marcelo.

From: user@cassandra.apache.org 
Subject: Re:Fastest way to map/parallel read all values in a table?

What’s the fastest way to map/parallel read all values in a table?

Kind of like a mini map only job.

I’m doing this to compute stats across our entire corpus.

What I did to begin with was use token() and then spit it into the number of 
splits I needed.

So I just took the total key range space which is -2^63 to 2^63 - 1 and broke 
it into N parts.

Then the queries come back as:

select * from mytable where token(primaryKey) >= x and token(primaryKey) < y

From reading on this list I thought this was the correct way to handle this 
problem.

However, I’m seeing horrible performance doing this.  After about 1% it just 
flat out locks up.

Could it be that I need to randomize the token order so that it’s not 
contiguous?  Maybe it’s all mapping on the first box to begin with.


-- 

Founder/CEO Spinn3r.com
Location: San Francisco, CA
blog: http://burtonator.wordpress.com
… or check out my Google+ profile

Re:Fastest way to map/parallel read all values in a table?

Reply via email to