Depending on whether you have deletes/updates, if this is an ad-hoc thing, you might want to just read the ss tables directly.
> On Feb 9, 2015, at 12:56 PM, Kevin Burton <bur...@spinn3r.com> wrote: > > I had considered using spark for this but: > > 1. we tried to deploy spark only to find out that it was missing a number of > key things we need. > > 2. our app needs to shut down to release threads and resources. Spark > doesn’t have support for this so all the workers would have stale thread > leaking afterwards. Though I guess if I can get workers to fork then I > should be ok. > > 3. Spark SQL actually returned invalid data to our queries… so that was kind > of a red flag and a non-starter > > On Mon, Feb 9, 2015 at 2:24 AM, Marcelo Valle (BLOOMBERG/ LONDON) > <mvallemil...@bloomberg.net <mailto:mvallemil...@bloomberg.net>> wrote: > Just for the record, I was doing the exact same thing in an internal > application in the start up I used to work. We have had the need of writing > custom code process in parallel all rows of a column family. Normally we > would use Spark for the job, but in our case the logic was a little more > complicated, so we wrote custom code. > > What we did was to run N process in M machines (N cores in each), each one > processing tasks. The tasks were created by splitting the range -2^ 63 to 2^ > 63 -1 in N*M*10 tasks. Even if data was not completely distributed along the > tasks, no machines were idle, as when some task was completed another one was > taken from the task pool. > > It was fast enough for us, but I am interested in knowing if there is a > better way of doing it. > > For your specific case, here is a tool we had opened as open source and can > be useful for simpler tests: https://github.com/s1mbi0se/cql_record_processor > <https://github.com/s1mbi0se/cql_record_processor> > > Also, I guess you probably know that, but I would consider using Spark for > doing this. > > Best regards, > Marcelo. > > From: user@cassandra.apache.org <mailto:user@cassandra.apache.org> > Subject: Re:Fastest way to map/parallel read all values in a table? > What’s the fastest way to map/parallel read all values in a table? > > Kind of like a mini map only job. > > I’m doing this to compute stats across our entire corpus. > > What I did to begin with was use token() and then spit it into the number of > splits I needed. > > So I just took the total key range space which is -2^63 to 2^63 - 1 and broke > it into N parts. > > Then the queries come back as: > > select * from mytable where token(primaryKey) >= x and token(primaryKey) < y > > From reading on this list I thought this was the correct way to handle this > problem. > > However, I’m seeing horrible performance doing this. After about 1% it just > flat out locks up. > > Could it be that I need to randomize the token order so that it’s not > contiguous? Maybe it’s all mapping on the first box to begin with. > > > > -- > > Founder/CEO Spinn3r.com <http://spinn3r.com/> > Location: San Francisco, CA > blog: http://burtonator.wordpress.com <http://burtonator.wordpress.com/> > … or check out my Google+ profile > <https://plus.google.com/102718274791889610666/posts> > <http://spinn3r.com/> > > > > -- > > Founder/CEO Spinn3r.com <http://spinn3r.com/> > Location: San Francisco, CA > blog: http://burtonator.wordpress.com <http://burtonator.wordpress.com/> > … or check out my Google+ profile > <https://plus.google.com/102718274791889610666/posts> > <http://spinn3r.com/> >
smime.p7s
Description: S/MIME cryptographic signature