Re: An extremely fast cassandra table full scan utility

2016-10-03 Thread Edward Capriolo
I undertook a similar effort a while ago.

https://issues.apache.org/jira/browse/CASSANDRA-7014

Other than the fact that it was closed with no comments, I can tell you
that other efforts I had to embed things in Cassandra did not go
swimmingly. Although at the time ideas were rejected like groovy udfs

On Mon, Oct 3, 2016 at 4:22 PM, Bhuvan Rawal  wrote:

> Hi Jonathan,
>
> If full scan is a regular requirement then setting up a spark cluster in
> locality with Cassandra nodes makes perfect sense. But supposing that it is
> a one off requirement, say a weekly or a fortnightly task, a spark cluster
> could be an added overhead with additional capacity, resource planning as
> far as operations / maintenance is concerned.
>
> So this could be thought a simple substitute for a single threaded scan
> without additional efforts to setup and maintain another technology.
>
> Regards,
> Bhuvan
>
> On Tue, Oct 4, 2016 at 1:37 AM, siddharth verma <
> sidd.verma29.l...@gmail.com> wrote:
>
>> Hi Jon,
>> It wan't allowed.
>> Moreover, if someone who isn't familiar with spark, and might be new to
>> map filter reduce etc. operations, could also use the utility for some
>> simple operations assuming a sequential scan of the cassandra table.
>>
>> Regards
>> Siddharth Verma
>>
>> On Tue, Oct 4, 2016 at 1:32 AM, Jonathan Haddad 
>> wrote:
>>
>>> Couldn't set up as couldn't get it working, or its not allowed?
>>>
>>> On Mon, Oct 3, 2016 at 3:23 PM Siddharth Verma <
>>> verma.siddha...@snapdeal.com> wrote:
>>>
 Hi Jon,
 We couldn't setup a spark cluster.

 For some use case, a spark cluster was required, but for some reason we
 couldn't create spark cluster. Hence, one may use this utility to iterate
 through the entire table at very high speed.

 Had to find a work around, that would be faster than paging on result
 set.

 Regards

 Siddharth Verma
 *Software Engineer I - CaMS*
 *M*: +91 9013689856, *T*: 011 22791596 *EXT*: 14697
 CA2125, 2nd Floor, ASF Centre-A, Jwala Mill Road,
 Udyog Vihar Phase - IV, Gurgaon-122016, INDIA
 Download Our App
 [image: A]
 
  [image:
 A]
 
  [image:
 W]
 

 On Tue, Oct 4, 2016 at 12:41 AM, Jonathan Haddad 
 wrote:

 It almost sounds like you're duplicating all the work of both spark and
 the connector. May I ask why you decided to not use the existing tools?

 On Mon, Oct 3, 2016 at 2:21 PM siddharth verma <
 sidd.verma29.l...@gmail.com> wrote:

 Hi DuyHai,
 Thanks for your reply.
 A few more features planned in the next one(if there is one) like,
 custom policy keeping in mind the replication of token range on
 specific nodes,
 fine graining the token range(for more speedup),
 and a few more.

 I think, as fine graining a token range,
 If one token range is split further in say, 2-3 parts, divided among
 threads, this would exploit the possible parallelism on a large scaled out
 cluster.

 And, as you mentioned the JIRA, streaming of request, that would of
 huge help with further splitting the range.

 Thanks once again for your valuable comments. :-)

 Regards,
 Siddharth Verma



>>
>


Re: An extremely fast cassandra table full scan utility

2016-10-03 Thread Bhuvan Rawal
Hi Jonathan,

If full scan is a regular requirement then setting up a spark cluster in
locality with Cassandra nodes makes perfect sense. But supposing that it is
a one off requirement, say a weekly or a fortnightly task, a spark cluster
could be an added overhead with additional capacity, resource planning as
far as operations / maintenance is concerned.

So this could be thought a simple substitute for a single threaded scan
without additional efforts to setup and maintain another technology.

Regards,
Bhuvan

On Tue, Oct 4, 2016 at 1:37 AM, siddharth verma  wrote:

> Hi Jon,
> It wan't allowed.
> Moreover, if someone who isn't familiar with spark, and might be new to
> map filter reduce etc. operations, could also use the utility for some
> simple operations assuming a sequential scan of the cassandra table.
>
> Regards
> Siddharth Verma
>
> On Tue, Oct 4, 2016 at 1:32 AM, Jonathan Haddad  wrote:
>
>> Couldn't set up as couldn't get it working, or its not allowed?
>>
>> On Mon, Oct 3, 2016 at 3:23 PM Siddharth Verma <
>> verma.siddha...@snapdeal.com> wrote:
>>
>>> Hi Jon,
>>> We couldn't setup a spark cluster.
>>>
>>> For some use case, a spark cluster was required, but for some reason we
>>> couldn't create spark cluster. Hence, one may use this utility to iterate
>>> through the entire table at very high speed.
>>>
>>> Had to find a work around, that would be faster than paging on result
>>> set.
>>>
>>> Regards
>>>
>>> Siddharth Verma
>>> *Software Engineer I - CaMS*
>>> *M*: +91 9013689856, *T*: 011 22791596 *EXT*: 14697
>>> CA2125, 2nd Floor, ASF Centre-A, Jwala Mill Road,
>>> Udyog Vihar Phase - IV, Gurgaon-122016, INDIA
>>> Download Our App
>>> [image: A]
>>> 
>>>  [image:
>>> A]
>>> 
>>>  [image:
>>> W]
>>> 
>>>
>>> On Tue, Oct 4, 2016 at 12:41 AM, Jonathan Haddad 
>>> wrote:
>>>
>>> It almost sounds like you're duplicating all the work of both spark and
>>> the connector. May I ask why you decided to not use the existing tools?
>>>
>>> On Mon, Oct 3, 2016 at 2:21 PM siddharth verma <
>>> sidd.verma29.l...@gmail.com> wrote:
>>>
>>> Hi DuyHai,
>>> Thanks for your reply.
>>> A few more features planned in the next one(if there is one) like,
>>> custom policy keeping in mind the replication of token range on specific
>>> nodes,
>>> fine graining the token range(for more speedup),
>>> and a few more.
>>>
>>> I think, as fine graining a token range,
>>> If one token range is split further in say, 2-3 parts, divided among
>>> threads, this would exploit the possible parallelism on a large scaled out
>>> cluster.
>>>
>>> And, as you mentioned the JIRA, streaming of request, that would of huge
>>> help with further splitting the range.
>>>
>>> Thanks once again for your valuable comments. :-)
>>>
>>> Regards,
>>> Siddharth Verma
>>>
>>>
>>>
>


Re: An extremely fast cassandra table full scan utility

2016-10-03 Thread siddharth verma
Hi Jon,
It wan't allowed.
Moreover, if someone who isn't familiar with spark, and might be new to map
filter reduce etc. operations, could also use the utility for some simple
operations assuming a sequential scan of the cassandra table.

Regards
Siddharth Verma

On Tue, Oct 4, 2016 at 1:32 AM, Jonathan Haddad  wrote:

> Couldn't set up as couldn't get it working, or its not allowed?
>
> On Mon, Oct 3, 2016 at 3:23 PM Siddharth Verma <
> verma.siddha...@snapdeal.com> wrote:
>
>> Hi Jon,
>> We couldn't setup a spark cluster.
>>
>> For some use case, a spark cluster was required, but for some reason we
>> couldn't create spark cluster. Hence, one may use this utility to iterate
>> through the entire table at very high speed.
>>
>> Had to find a work around, that would be faster than paging on result set.
>>
>> Regards
>>
>> Siddharth Verma
>> *Software Engineer I - CaMS*
>> *M*: +91 9013689856, *T*: 011 22791596 *EXT*: 14697
>> CA2125, 2nd Floor, ASF Centre-A, Jwala Mill Road,
>> Udyog Vihar Phase - IV, Gurgaon-122016, INDIA
>> Download Our App
>> [image: A]
>> 
>>  [image:
>> A]
>> 
>>  [image:
>> W]
>> 
>>
>> On Tue, Oct 4, 2016 at 12:41 AM, Jonathan Haddad 
>> wrote:
>>
>> It almost sounds like you're duplicating all the work of both spark and
>> the connector. May I ask why you decided to not use the existing tools?
>>
>> On Mon, Oct 3, 2016 at 2:21 PM siddharth verma <
>> sidd.verma29.l...@gmail.com> wrote:
>>
>> Hi DuyHai,
>> Thanks for your reply.
>> A few more features planned in the next one(if there is one) like,
>> custom policy keeping in mind the replication of token range on specific
>> nodes,
>> fine graining the token range(for more speedup),
>> and a few more.
>>
>> I think, as fine graining a token range,
>> If one token range is split further in say, 2-3 parts, divided among
>> threads, this would exploit the possible parallelism on a large scaled out
>> cluster.
>>
>> And, as you mentioned the JIRA, streaming of request, that would of huge
>> help with further splitting the range.
>>
>> Thanks once again for your valuable comments. :-)
>>
>> Regards,
>> Siddharth Verma
>>
>>
>>


Re: An extremely fast cassandra table full scan utility

2016-10-03 Thread Jonathan Haddad
Couldn't set up as couldn't get it working, or its not allowed?
On Mon, Oct 3, 2016 at 3:23 PM Siddharth Verma 
wrote:

> Hi Jon,
> We couldn't setup a spark cluster.
>
> For some use case, a spark cluster was required, but for some reason we
> couldn't create spark cluster. Hence, one may use this utility to iterate
> through the entire table at very high speed.
>
> Had to find a work around, that would be faster than paging on result set.
>
> Regards
>
> Siddharth Verma
> *Software Engineer I - CaMS*
> *M*: +91 9013689856, *T*: 011 22791596 *EXT*: 14697
> CA2125, 2nd Floor, ASF Centre-A, Jwala Mill Road,
> Udyog Vihar Phase - IV, Gurgaon-122016, INDIA
> Download Our App
> [image: A]
> 
>  [image:
> A]
> 
>  [image:
> W]
> 
>
> On Tue, Oct 4, 2016 at 12:41 AM, Jonathan Haddad 
> wrote:
>
> It almost sounds like you're duplicating all the work of both spark and
> the connector. May I ask why you decided to not use the existing tools?
>
> On Mon, Oct 3, 2016 at 2:21 PM siddharth verma <
> sidd.verma29.l...@gmail.com> wrote:
>
> Hi DuyHai,
> Thanks for your reply.
> A few more features planned in the next one(if there is one) like,
> custom policy keeping in mind the replication of token range on specific
> nodes,
> fine graining the token range(for more speedup),
> and a few more.
>
> I think, as fine graining a token range,
> If one token range is split further in say, 2-3 parts, divided among
> threads, this would exploit the possible parallelism on a large scaled out
> cluster.
>
> And, as you mentioned the JIRA, streaming of request, that would of huge
> help with further splitting the range.
>
> Thanks once again for your valuable comments. :-)
>
> Regards,
> Siddharth Verma
>
>
>


Re: An extremely fast cassandra table full scan utility

2016-10-03 Thread Siddharth Verma
Hi Jon,
We couldn't setup a spark cluster.
For some use case, a spark cluster was required, but for some reason we
couldn't create spark cluster. Hence, one may use this utility to iterate
through the entire table at very high speed.

Had to find a work around, that would be faster than paging on result set.

Regards

Siddharth Verma
*Software Engineer I - CaMS*
*M*: +91 9013689856, *T*: 011 22791596 *EXT*: 14697
CA2125, 2nd Floor, ASF Centre-A, Jwala Mill Road,
Udyog Vihar Phase - IV, Gurgaon-122016, INDIA
Download Our App
[image: A]

[image:
A]

[image:
W]


On Tue, Oct 4, 2016 at 12:41 AM, Jonathan Haddad  wrote:

> It almost sounds like you're duplicating all the work of both spark and
> the connector. May I ask why you decided to not use the existing tools?
>
> On Mon, Oct 3, 2016 at 2:21 PM siddharth verma <
> sidd.verma29.l...@gmail.com> wrote:
>
>> Hi DuyHai,
>> Thanks for your reply.
>> A few more features planned in the next one(if there is one) like,
>> custom policy keeping in mind the replication of token range on specific
>> nodes,
>> fine graining the token range(for more speedup),
>> and a few more.
>>
>> I think, as fine graining a token range,
>> If one token range is split further in say, 2-3 parts, divided among
>> threads, this would exploit the possible parallelism on a large scaled out
>> cluster.
>>
>> And, as you mentioned the JIRA, streaming of request, that would of huge
>> help with further splitting the range.
>>
>> Thanks once again for your valuable comments. :-)
>>
>> Regards,
>> Siddharth Verma
>>
>


Re: An extremely fast cassandra table full scan utility

2016-10-03 Thread Jonathan Haddad
It almost sounds like you're duplicating all the work of both spark and the
connector. May I ask why you decided to not use the existing tools?

On Mon, Oct 3, 2016 at 2:21 PM siddharth verma 
wrote:

> Hi DuyHai,
> Thanks for your reply.
> A few more features planned in the next one(if there is one) like,
> custom policy keeping in mind the replication of token range on specific
> nodes,
> fine graining the token range(for more speedup),
> and a few more.
>
> I think, as fine graining a token range,
> If one token range is split further in say, 2-3 parts, divided among
> threads, this would exploit the possible parallelism on a large scaled out
> cluster.
>
> And, as you mentioned the JIRA, streaming of request, that would of huge
> help with further splitting the range.
>
> Thanks once again for your valuable comments. :-)
>
> Regards,
> Siddharth Verma
>


Re: An extremely fast cassandra table full scan utility

2016-10-03 Thread Bhuvan Rawal
It will be interesting to have a comparison with spark here for basic use
cases.

>From a naive observation it appears that this could be slower than spark as
a lot of data is streamed over network.

On the other hand in this approach we have seen that Young GC takes nearly
full CPU (possibly because a lot of data I moved on and off heap, which has
been seen as Young Gen keeps getting empty and full sometimes multiple
times a second) and that should be there with spark as well as it will be
calling Cassandra driver, on top of that Spark cluster will be sharing same
compute resources where it does filtering/doing operations on data. If we
have an appropriately sized client machine with enough network bandwidth
this could potentially work faster, ofcourse for basic scanning use cases.

Which of these assumptions seems to be more appropriate?

On Mon, Oct 3, 2016 at 11:40 PM, DuyHai Doan  wrote:

> Hello Siddarth
>
> I just throw an eye over the architecture diagram. The idea of using
> multiple threads, one for each token range is great. It help maxing out
> parallelism.
>
> With https://issues.apache.org/jira/browse/CASSANDRA-11521 it would be
> even faster.
>
> On Mon, Oct 3, 2016 at 7:51 PM, siddharth verma <
> sidd.verma29.l...@gmail.com> wrote:
>
>> Hi,
>> I was working on a utility which can be used for cassandra full table
>> scan, at a tremendously high velocity, cassandra fast full table scan.
>> How fast?
>> The script dumped ~ 229 million rows in 116 seconds, with a cluster of
>> size 6 nodes.
>> Data transfer rates were upto 25MBps was observed on cassandra nodes.
>>
>> For some use case, a spark cluster was required, but for some reason we
>> couldn't create spark cluster. Hence, one may use this utility to iterate
>> through the entire table at very high speed.
>>
>> But now for any full scan, I use it freely for my adhoc java programs to
>> manipulate or aggregate cassandra data.
>>
>> You can customize the options, setting fetch size, consistency level,
>> degree of parallelism(number of threads) according to your need.
>>
>> You can visit https://github.com/siddv29/cfs to go through the code, see
>> the logic behind it, or try it in your program.
>> A sample program is also provided.
>>
>> I coded this utility in java.
>>
>> Bhuvan Rawal(bhu1ra...@gmail.com) and I worked on this concept.
>> For python you may visit his blog(http://casualreflections.
>> io/tech/cassandra/python/Multiprocess-Producer-Cassandra-Python) and
>> github(https://gist.github.com/bhuvanrawal/93c5ae6cdd020de47
>> e0981d36d2c0785)
>>
>> Looking forward to your suggestions and comments.
>>
>> P.S. Give it a try. Trust me, the iteration speed is awesome!!
>> It is a bare application, built asap. If you would like to contribute to
>> the java utility, add or build up on it, do reach out
>> sidd.verma29.li...@gmail.com
>>
>> Thanks and Regards,
>> Siddharth Verma
>> (previous email id on this mailing list : verma.siddha...@snapdeal.com)
>>
>
>


Re: An extremely fast cassandra table full scan utility

2016-10-03 Thread siddharth verma
Hi DuyHai,
Thanks for your reply.
A few more features planned in the next one(if there is one) like,
custom policy keeping in mind the replication of token range on specific
nodes,
fine graining the token range(for more speedup),
and a few more.

I think, as fine graining a token range,
If one token range is split further in say, 2-3 parts, divided among
threads, this would exploit the possible parallelism on a large scaled out
cluster.

And, as you mentioned the JIRA, streaming of request, that would of huge
help with further splitting the range.

Thanks once again for your valuable comments. :-)

Regards,
Siddharth Verma


Re: An extremely fast cassandra table full scan utility

2016-10-03 Thread DuyHai Doan
Hello Siddarth

I just throw an eye over the architecture diagram. The idea of using
multiple threads, one for each token range is great. It help maxing out
parallelism.

With https://issues.apache.org/jira/browse/CASSANDRA-11521 it would be even
faster.

On Mon, Oct 3, 2016 at 7:51 PM, siddharth verma  wrote:

> Hi,
> I was working on a utility which can be used for cassandra full table
> scan, at a tremendously high velocity, cassandra fast full table scan.
> How fast?
> The script dumped ~ 229 million rows in 116 seconds, with a cluster of
> size 6 nodes.
> Data transfer rates were upto 25MBps was observed on cassandra nodes.
>
> For some use case, a spark cluster was required, but for some reason we
> couldn't create spark cluster. Hence, one may use this utility to iterate
> through the entire table at very high speed.
>
> But now for any full scan, I use it freely for my adhoc java programs to
> manipulate or aggregate cassandra data.
>
> You can customize the options, setting fetch size, consistency level,
> degree of parallelism(number of threads) according to your need.
>
> You can visit https://github.com/siddv29/cfs to go through the code, see
> the logic behind it, or try it in your program.
> A sample program is also provided.
>
> I coded this utility in java.
>
> Bhuvan Rawal(bhu1ra...@gmail.com) and I worked on this concept.
> For python you may visit his blog(http://casualreflections.
> io/tech/cassandra/python/Multiprocess-Producer-Cassandra-Python) and
> github(https://gist.github.com/bhuvanrawal/93c5ae6cdd020de47e0981d36d2c07
> 85)
>
> Looking forward to your suggestions and comments.
>
> P.S. Give it a try. Trust me, the iteration speed is awesome!!
> It is a bare application, built asap. If you would like to contribute to
> the java utility, add or build up on it, do reach out
> sidd.verma29.li...@gmail.com
>
> Thanks and Regards,
> Siddharth Verma
> (previous email id on this mailing list : verma.siddha...@snapdeal.com)
>


An extremely fast cassandra table full scan utility

2016-10-03 Thread siddharth verma
Hi,
I was working on a utility which can be used for cassandra full table scan,
at a tremendously high velocity, cassandra fast full table scan.
How fast?
The script dumped ~ 229 million rows in 116 seconds, with a cluster of size
6 nodes.
Data transfer rates were upto 25MBps was observed on cassandra nodes.

For some use case, a spark cluster was required, but for some reason we
couldn't create spark cluster. Hence, one may use this utility to iterate
through the entire table at very high speed.

But now for any full scan, I use it freely for my adhoc java programs to
manipulate or aggregate cassandra data.

You can customize the options, setting fetch size, consistency level,
degree of parallelism(number of threads) according to your need.

You can visit https://github.com/siddv29/cfs to go through the code, see
the logic behind it, or try it in your program.
A sample program is also provided.

I coded this utility in java.

Bhuvan Rawal(bhu1ra...@gmail.com) and I worked on this concept.
For python you may visit his blog(
http://casualreflections.io/tech/cassandra/python/Multiprocess-Producer-Cassandra-Python)
and github(
https://gist.github.com/bhuvanrawal/93c5ae6cdd020de47e0981d36d2c0785)

Looking forward to your suggestions and comments.

P.S. Give it a try. Trust me, the iteration speed is awesome!!
It is a bare application, built asap. If you would like to contribute to
the java utility, add or build up on it, do reach out
sidd.verma29.li...@gmail.com

Thanks and Regards,
Siddharth Verma
(previous email id on this mailing list : verma.siddha...@snapdeal.com)