Re: unsubscribe

2015-02-22 Thread Mark Reddy
Raman,

To unsubscribe send a mail to user-unsubscr...@cassandra.apache.org


On 22 February 2015 at 03:59, Raman ra...@assetcorporation.net wrote:

 unsubscribe



Re: Running Cassandra + Spark on AWS - architecture questions

2015-02-22 Thread Eric Stevens
I'm not sure if this is a good use case for you, but you might also
consider setting up several keyspaces, one for any data you want available
for analytics (such as business object tables), and one for data you don't
want to do analytics on (such as custom secondary indices).  Maybe a third
one for data which should only exist in the analytics space, such as for
temporary rollup data.

This can reduce the amount of data you replicate into your analytics space,
and allow you to run a smaller analytics cluster than your production
cluster.

On Fri, Feb 20, 2015 at 2:43 PM, DuyHai Doan doanduy...@gmail.com wrote:

 Cassandra would take care of keeping the data synced between the two
 sets of five nodes.  Is that correct?

 Correct

 But doing so means that we need 2x as many nodes as we need for the
 real-time cluster alone

 Not necessarily. With multi DC you can configure the replication factor
 value per DC, meaning that you can have RF = 3 for the real time DC and
 RF=1 or RF=2 for the analytics DC. Thus the number of nodes can be
 different for each DC

 In addition, you can also tune the hardware. If the realtime DC is mostly
 write only for incoming data and read-only from aggregated table, it is
 less IO intensive than the analytics DC with lot of read  write to compute
 aggregations.



 On Fri, Feb 20, 2015 at 10:17 PM, Clint Kelly clint.ke...@gmail.com
 wrote:

 Hi all,

 I read the DSE 4.6 documentation and I'm still not 100% sure what a mixed
 workload Cassandra + Spark installation would look like, especially on
 AWS.  What I gather is that you use OpsCenter to set up the following:


- One virtual data center for real-time processing (e.g., ingestion
of time-series data, replying to requests for an interactive application)
- Another virtual data center for batch analytics (Spark, possibly
for machine learning)


 If I understand this correctly, if I estimate that I need a five-node
 cluster to handle all of my data, under the system described above, I would
 have five nodes serving real-time traffic and all of the data replicated in
 another five nodes that I use for batch processing.  Cassandra would take
 care of keeping the data synced between the two sets of five nodes.  Is
 that correct?

 I assume the motivation for such a dual-virtual-data-center architecture
 is to prevent the Spark jobs (which are going to do lots of scans from
 Cassandra, and maybe run computation on the same machines hosting
 Cassandra) from disrupting the real-time performance.  But doing so means
 that we need 2x as many nodes as we need for the real-time cluster alone.

 *Could someone confirm that my interpretation above of what I read about
 in the DSE documentation is correct?*

 If my application needs to run analytics on Spark only a few hours a day,
 might we be better off spending our money to get a bigger Cassandra cluster
 and then just spin up Spark jobs on EMR for a few hours at night?  (I know
 this is a hard question to answer, since it all depends on the
 application---just curious if anyone else here has had to make similar
 tradeoffs.)  e.g., maybe instead of having a five-node real-time cluster,
 we would have an eight-node real-time cluster, and use our remaining budget
 on EMR jobs.

 I am curious if anyone has any thoughts / experience about this.

 Best regards,
 Clint





Efficient .net client for cassandra

2015-02-22 Thread Asit KAUSHIK
Hi All,

We have been able to find our case specific full text which we are
analyzing using Staratio Cassandra. It has modified secondary index api
which uses lucene indices. The erformace also seems good to me . Still i
wanted to ask you gurus

1) Has anybody used Startio and any drawbacks of it
2) We are using .Net as the client to extract data which lacks performance
. I am using the tradition connection pooling and then executing the
prepared statement. So anybody who is using any specific client for .net
would help me on this

Thanks in advance for the help

Thanks and Regards
Asit


Re: C* 2.1.3 - Incremental replacement of compacted SSTables

2015-02-22 Thread Marcus Eriksson
We had some issues with it right before we wanted to release 2.1.3 so we
temporarily(?) disabled it, it *might* get removed entirely in 2.1.4, if
you have any input, please comment on this ticket:
https://issues.apache.org/jira/browse/CASSANDRA-8833

/Marcus

On Sat, Feb 21, 2015 at 7:29 PM, Mark Greene green...@gmail.com wrote:

 I saw in the NEWS.txt that this has been disabled.

 Does anyone know why that was the case? Is it temporary just for the 2.1.3
 release?

 Thanks,
 Mark Greene



Re: run cassandra on a small instance

2015-02-22 Thread Ben Bromhead
You might also have some gains setting in_memory_compaction_limit_in_mb to
something very low to force Cassandra to use on disk compaction rather than
doing it in memory.

On 23 February 2015 at 14:12, Tim Dunphy bluethu...@gmail.com wrote:

 Nate,

  Definitely thank you for this advice. After leaving the new Cassandra
 node running on the 2GB instance for the past couple of days, I think I've
 had ample reason to report complete success in getting it stabilized on
 that instance! Here are the changes I've been able to make:

  I think manipulating the key cache and other stuff like concurrent writes
 and some of the other stuff I worked on based on that thread from the
 cassandra list definitely was key in getting Cassandra to work on the new
 instance.

 Check out the before and after (before working/ after working):

 Before in cassandra-env.sh:
MAX_HEAP_SIZE=800M
HEAP_NEWSIZE=200M

 After:
 MAX_HEAP_SIZE=512M
 HEAP_NEWSIZE=100M

 And before in the cassandra.yaml file:

concurrent_writes: 32
compaction_throughput_mb_per_sec: 16
key_cache_size_in_mb:
key_cache_save_period: 14400
# native_transport_max_threads: 128

 And after:

 concurrent_writes: 2
 compaction_throughput_mb_per_sec: 8
 key_cache_size_in_mb: 4
 key_cache_save_period: 0
 native_transport_max_threads: 4


 That really made the difference. I'm a puppet user, so these changes are
 in puppet. So any new 2GB instances I should bring up on Digital Ocean
 should absolutely work the way the first 2GB node does, there.  But I was
 able to make enough sense of your chef recipe to adapt what you were
 showing me.

 Thanks again!
 Tim

 On Fri, Feb 20, 2015 at 10:31 PM, Tim Dunphy bluethu...@gmail.com wrote:

 The most important things to note:
 - don't include JNA (it needs to lock pages larger than what will be
 available)
 - turn down threadpools for transports
 - turn compaction throughput way down
 - make concurrent reads and writes very small
 I have used the above run a healthy 5 node clusters locally in it's own
 private network with a 6th monitoring server for light to moderate local
 testing in 16g of laptop ram. YMMV but it is possible.


 Thanks!! That was very helpful. I just tried applying your suggestions to
 my cassandra.yaml file. I used the info from your chef recipe. Well like
 I've been saying typically it takes about 5 hours or so for this situation
 to shake itself out. I'll provide an update to the list once I have a
 better idea of how this is working.

 Thanks again!
 Tim

 On Fri, Feb 20, 2015 at 9:37 PM, Nate McCall n...@thelastpickle.com
 wrote:

 I frequently test with multi-node vagrant-based clusters locally. The
 following chef attributes should give you an idea of what to turn down in
 cassandra.yaml and cassandra-env.sh to build a decent testing cluster:

   :cassandra = {'cluster_name' = 'VerifyCluster',
  'package_name' = 'dsc20',
  'version' = '2.0.11',
  'release' = '1',
  'setup_jna' = false,
  'max_heap_size' = '512M',
  'heap_new_size' = '100M',
  'initial_token' = server['initial_token'],
  'seeds' = 192.168.33.10,
  'listen_address' = server['ip'],
  'broadcast_address' = server['ip'],
  'rpc_address' = server['ip'],
  'conconcurrent_reads' = 2,
  'concurrent_writes' = 2,
  'memtable_flush_queue_size' = 2,
  'compaction_throughput_mb_per_sec' = 8,
  'key_cache_size_in_mb' = 4,
  'key_cache_save_period' = 0,
  'native_transport_min_threads' = 2,
  'native_transport_max_threads' = 4,
  'notify_restart' = true,
  'reporter' = {
'riemann' = {
  'enable' = true,
  'host' = '192.168.33.51'
},
'graphite' = {
  'enable' = true,
  'host' = '192.168.33.51'
}
  }
},

 The most important things to note:
 - don't include JNA (it needs to lock pages larger than what will be
 available)
 - turn down threadpools for transports
 - turn compaction throughput way down
 - make concurrent reads and writes very small

 I have used the above run a healthy 5 node clusters locally in it's own
 private network with a 6th monitoring server for light to moderate local
 testing in 16g of laptop ram. YMMV but it is possible.




 --
 GPG me!!

 gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B




 --

Re: run cassandra on a small instance

2015-02-22 Thread Tim Dunphy
Nate,

 Definitely thank you for this advice. After leaving the new Cassandra node
running on the 2GB instance for the past couple of days, I think I've had
ample reason to report complete success in getting it stabilized on that
instance! Here are the changes I've been able to make:

 I think manipulating the key cache and other stuff like concurrent writes
and some of the other stuff I worked on based on that thread from the
cassandra list definitely was key in getting Cassandra to work on the new
instance.

Check out the before and after (before working/ after working):

Before in cassandra-env.sh:
   MAX_HEAP_SIZE=800M
   HEAP_NEWSIZE=200M

After:
MAX_HEAP_SIZE=512M
HEAP_NEWSIZE=100M

And before in the cassandra.yaml file:

   concurrent_writes: 32
   compaction_throughput_mb_per_sec: 16
   key_cache_size_in_mb:
   key_cache_save_period: 14400
   # native_transport_max_threads: 128

And after:

concurrent_writes: 2
compaction_throughput_mb_per_sec: 8
key_cache_size_in_mb: 4
key_cache_save_period: 0
native_transport_max_threads: 4


That really made the difference. I'm a puppet user, so these changes are in
puppet. So any new 2GB instances I should bring up on Digital Ocean should
absolutely work the way the first 2GB node does, there.  But I was able to
make enough sense of your chef recipe to adapt what you were showing me.

Thanks again!
Tim

On Fri, Feb 20, 2015 at 10:31 PM, Tim Dunphy bluethu...@gmail.com wrote:

 The most important things to note:
 - don't include JNA (it needs to lock pages larger than what will be
 available)
 - turn down threadpools for transports
 - turn compaction throughput way down
 - make concurrent reads and writes very small
 I have used the above run a healthy 5 node clusters locally in it's own
 private network with a 6th monitoring server for light to moderate local
 testing in 16g of laptop ram. YMMV but it is possible.


 Thanks!! That was very helpful. I just tried applying your suggestions to
 my cassandra.yaml file. I used the info from your chef recipe. Well like
 I've been saying typically it takes about 5 hours or so for this situation
 to shake itself out. I'll provide an update to the list once I have a
 better idea of how this is working.

 Thanks again!
 Tim

 On Fri, Feb 20, 2015 at 9:37 PM, Nate McCall n...@thelastpickle.com
 wrote:

 I frequently test with multi-node vagrant-based clusters locally. The
 following chef attributes should give you an idea of what to turn down in
 cassandra.yaml and cassandra-env.sh to build a decent testing cluster:

   :cassandra = {'cluster_name' = 'VerifyCluster',
  'package_name' = 'dsc20',
  'version' = '2.0.11',
  'release' = '1',
  'setup_jna' = false,
  'max_heap_size' = '512M',
  'heap_new_size' = '100M',
  'initial_token' = server['initial_token'],
  'seeds' = 192.168.33.10,
  'listen_address' = server['ip'],
  'broadcast_address' = server['ip'],
  'rpc_address' = server['ip'],
  'conconcurrent_reads' = 2,
  'concurrent_writes' = 2,
  'memtable_flush_queue_size' = 2,
  'compaction_throughput_mb_per_sec' = 8,
  'key_cache_size_in_mb' = 4,
  'key_cache_save_period' = 0,
  'native_transport_min_threads' = 2,
  'native_transport_max_threads' = 4,
  'notify_restart' = true,
  'reporter' = {
'riemann' = {
  'enable' = true,
  'host' = '192.168.33.51'
},
'graphite' = {
  'enable' = true,
  'host' = '192.168.33.51'
}
  }
},

 The most important things to note:
 - don't include JNA (it needs to lock pages larger than what will be
 available)
 - turn down threadpools for transports
 - turn compaction throughput way down
 - make concurrent reads and writes very small

 I have used the above run a healthy 5 node clusters locally in it's own
 private network with a 6th monitoring server for light to moderate local
 testing in 16g of laptop ram. YMMV but it is possible.




 --
 GPG me!!

 gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B




-- 
GPG me!!

gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B