Re: C* 2.1.2 invokes oom-killer

2015-02-23 Thread Michał Łowicki
After couple of days it's still behaving fine. Case closed.

On Thu, Feb 19, 2015 at 11:15 PM, Michał Łowicki mlowi...@gmail.com wrote:

 Upgrade to 2.1.3 seems to help so far. After ~12 hours total memory
 consumption grew from 10GB to 10.5GB.

 On Thu, Feb 19, 2015 at 2:02 PM, Carlos Rolo r...@pythian.com wrote:

 Then you are probably hitting a bug... Trying to find out in Jira. The
 bad news is the fix is only to be released on 2.1.4. Once I find it out I
 will post it here.

 Regards,

 Carlos Juzarte Rolo
 Cassandra Consultant

 Pythian - Love your data

 rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
 http://linkedin.com/in/carlosjuzarterolo*
 Tel: 1649
 www.pythian.com

 On Thu, Feb 19, 2015 at 12:16 PM, Michał Łowicki mlowi...@gmail.com
 wrote:

 |trickle_fsync| has been enabled for long time in our settings (just
 noticed):

 trickle_fsync: true

 trickle_fsync_interval_in_kb: 10240

 On Thu, Feb 19, 2015 at 12:12 PM, Michał Łowicki mlowi...@gmail.com
 wrote:



 On Thu, Feb 19, 2015 at 11:02 AM, Carlos Rolo r...@pythian.com wrote:

 Do you have trickle_fsync enabled? Try to enable that and see if it
 solves your problem, since you are getting out of non-heap memory.

 Another question, is always the same nodes that die? Or is 2 out of 4
 that die?


 Always the same nodes. Upgraded to 2.1.3 two hours ago so we'll monitor
 if maybe issue has been fixed there. If not will try to enable
 |tricke_fsync|



 Regards,

 Carlos Juzarte Rolo
 Cassandra Consultant

 Pythian - Love your data

 rolo@pythian | Twitter: cjrolo | Linkedin: 
 *linkedin.com/in/carlosjuzarterolo
 http://linkedin.com/in/carlosjuzarterolo*
 Tel: 1649
 www.pythian.com

 On Thu, Feb 19, 2015 at 10:49 AM, Michał Łowicki mlowi...@gmail.com
 wrote:



 On Thu, Feb 19, 2015 at 10:41 AM, Carlos Rolo r...@pythian.com
 wrote:

 So compaction doesn't seem to be your problem (You can check with
 nodetool compactionstats just to be sure).


 pending tasks: 0



 How much is your write latency on your column families? I had OOM
 related to this before, and there was a tipping point around 70ms.


 Write request latency is below 0.05 ms/op (avg). Checked with
 OpsCenter.



 --






 --
 BR,
 Michał Łowicki



 --






 --
 BR,
 Michał Łowicki




 --
 BR,
 Michał Łowicki



 --






 --
 BR,
 Michał Łowicki




-- 
BR,
Michał Łowicki


Commitlog activities

2015-02-23 Thread ssiv...@gmail.com

Hi!

I have the following keyspaces

cqlsh SELECT * FROM system.schema_keyspaces;
 keyspace_name  | durable_writes | 
strategy_class  | 
strategy_options

---++-+
system   |   True| 
org.apache.cassandra.locator.LocalStrategy | {}
 system_traces   |  False| 
org.apache.cassandra.locator.SimpleStrategy  | {replication_factor:2}
 a1_ks |  False| 
org.apache.cassandra.locator.SimpleStrategy  | {replication_factor:1}


I have two disks. Data directory is on sda. Commitlog is on sdb.
I do 100% writes into a1_ks.user_table .

Watching IO activities I noticed that C* write something (mutations) 
into commitlog. But it's strange because I disabled durable writes for 
a1_ks.

May be it's system activities and they are flushed into commitlog?


--
Thanks,
Serj



Re: Why no virtual nodes for Cassandra on EC2?

2015-02-23 Thread Clint Kelly
Hi mck,

I'm not familiar with this ticket, but my understanding was that
performance of Hadoop jobs on C* clusters with vnodes was poor because a
given Hadoop input split has to run many individual scans (one for each
vnode) rather than just a single scan.  I've run C* and Hadoop in
production with a custom input format that used vnodes (and just combined
multiple vnodes in a single input split) and didn't have any issues (the
jobs had many other performance bottlenecks besides starting multiple scans
from C*).

This is one of the videos where I recall an off-hand mention of the Spark
connector working with vnodes: https://www.youtube.com/watch?v=1NtnrdIUlg0

Best regards,
Clint




On Sat, Feb 21, 2015 at 2:58 PM, mck m...@apache.org wrote:

 At least the problem of hadoop and vnodes described in CASSANDRA-6091
 doesn't apply to spark.
  (Spark already allows multiple token ranges per split).

 If this is the reason why DSE hasn't enabled vnodes then fingers crossed
 that'll change soon.


  Some of the DataStax videos that I watched discussed how the Cassandra
 Spark connecter has
  optimizations to deal with vnodes.


 Are these videos public? if so got any link to them?

 ~mck



Re: Running Cassandra + Spark on AWS - architecture questions

2015-02-23 Thread Clint Kelly
These are both good suggestions, thanks!

I thought I had remembered reading that different virtual datacenters
should always have the same number of nodes.  I think I was mistaken about
that.  In the past we had major issues running huge analytics jobs on data
stored in HBase (it would bring down our real-time performance), so this
capability of Cassandra is great!

Best regards,
Clint


On Sun, Feb 22, 2015 at 8:02 AM, Eric Stevens migh...@gmail.com wrote:

 I'm not sure if this is a good use case for you, but you might also
 consider setting up several keyspaces, one for any data you want available
 for analytics (such as business object tables), and one for data you don't
 want to do analytics on (such as custom secondary indices).  Maybe a third
 one for data which should only exist in the analytics space, such as for
 temporary rollup data.

 This can reduce the amount of data you replicate into your analytics
 space, and allow you to run a smaller analytics cluster than your
 production cluster.

 On Fri, Feb 20, 2015 at 2:43 PM, DuyHai Doan doanduy...@gmail.com wrote:

 Cassandra would take care of keeping the data synced between the two
 sets of five nodes.  Is that correct?

 Correct

 But doing so means that we need 2x as many nodes as we need for the
 real-time cluster alone

 Not necessarily. With multi DC you can configure the replication factor
 value per DC, meaning that you can have RF = 3 for the real time DC and
 RF=1 or RF=2 for the analytics DC. Thus the number of nodes can be
 different for each DC

 In addition, you can also tune the hardware. If the realtime DC is mostly
 write only for incoming data and read-only from aggregated table, it is
 less IO intensive than the analytics DC with lot of read  write to compute
 aggregations.



 On Fri, Feb 20, 2015 at 10:17 PM, Clint Kelly clint.ke...@gmail.com
 wrote:

 Hi all,

 I read the DSE 4.6 documentation and I'm still not 100% sure what a
 mixed workload Cassandra + Spark installation would look like, especially
 on AWS.  What I gather is that you use OpsCenter to set up the following:


- One virtual data center for real-time processing (e.g.,
ingestion of time-series data, replying to requests for an interactive
application)
- Another virtual data center for batch analytics (Spark, possibly
for machine learning)


 If I understand this correctly, if I estimate that I need a five-node
 cluster to handle all of my data, under the system described above, I would
 have five nodes serving real-time traffic and all of the data replicated in
 another five nodes that I use for batch processing.  Cassandra would take
 care of keeping the data synced between the two sets of five nodes.  Is
 that correct?

 I assume the motivation for such a dual-virtual-data-center architecture
 is to prevent the Spark jobs (which are going to do lots of scans from
 Cassandra, and maybe run computation on the same machines hosting
 Cassandra) from disrupting the real-time performance.  But doing so means
 that we need 2x as many nodes as we need for the real-time cluster alone.

 *Could someone confirm that my interpretation above of what I read about
 in the DSE documentation is correct?*

 If my application needs to run analytics on Spark only a few hours a
 day, might we be better off spending our money to get a bigger Cassandra
 cluster and then just spin up Spark jobs on EMR for a few hours at night?
  (I know this is a hard question to answer, since it all depends on the
 application---just curious if anyone else here has had to make similar
 tradeoffs.)  e.g., maybe instead of having a five-node real-time cluster,
 we would have an eight-node real-time cluster, and use our remaining budget
 on EMR jobs.

 I am curious if anyone has any thoughts / experience about this.

 Best regards,
 Clint






Any notion of unions in C* user-defined types?

2015-02-23 Thread Clint Kelly
Hi all,

I am building an application that keeps a time-series record of clickstream
data (clicks, impressions, etc.).  The data model looks something like:

CREATE TABLE clickstream (
  userid text,
  event_time timestamp,
  interaction frozen interaction_type,
  PRIMARY KEY (userid, timestamp)
) WITH CLUSTERING ORDER BY (event_time DESC);

I would like to create a user-defined type interaction_type such that it
can be different depending on whether the interaction was a click, view,
etc.

Previously we encoded such data with Avro, using Avro's unions (
http://avro.apache.org/docs/1.7.5/idl.html#unions) and encoded the data as
blobs.  I was hoping to get away from blobs now that we have UDTs in
Cassandra 2.1, but I don't see any support for unions.

Does anyone have any suggestions?  I think I may be better of just sticking
with Avro serialization.  :(

Best regards,
Clint


Re: Why no virtual nodes for Cassandra on EC2?

2015-02-23 Thread mck

 … my understanding was that
 performance of Hadoop jobs on C* clusters with vnodes was poor because a
 given Hadoop input split has to run many individual scans (one for each
 vnode) rather than just a single scan.  I've run C* and Hadoop in
 production with a custom input format that used vnodes (and just combined
 multiple vnodes in a single input split) and didn't have any issues (the
 jobs had many other performance bottlenecks besides starting multiple
 scans from C*).

You've described the ticket, and how it has been solved :-)

 This is one of the videos where I recall an off-hand mention of the Spark
 connector working with vnodes:
 https://www.youtube.com/watch?v=1NtnrdIUlg0

Thanks.

~mck


Re: Why no virtual nodes for Cassandra on EC2?

2015-02-23 Thread Eric Stevens
Vnodes is officially disrecommended for DSE Solr integration (though a
small number isn't ruinous). That might be why they still don't enable them
by default.
On Feb 21, 2015 3:58 PM, mck m...@apache.org wrote:

 At least the problem of hadoop and vnodes described in CASSANDRA-6091
 doesn't apply to spark.
  (Spark already allows multiple token ranges per split).

 If this is the reason why DSE hasn't enabled vnodes then fingers crossed
 that'll change soon.


  Some of the DataStax videos that I watched discussed how the Cassandra
 Spark connecter has
  optimizations to deal with vnodes.


 Are these videos public? if so got any link to them?

 ~mck



Re: Why no virtual nodes for Cassandra on EC2?

2015-02-23 Thread Jack Krupansky
DSE 4.6 improved Solr vnode performance dramatically, so that vnodes for
Search workloads is now no longer officially discouraged. As per the
official doc for improvements, : *Ability to use virtual nodes (vnodes) in
Solr nodes. Recommended range: 64 to 256 (overhead increases by
approximately 30%)*. A vnode token count of 64 or 32 would reduce that
overhead further. And... the new 4.6 feature of being able to direct a Solr
query to a specific partition essentially eliminates that overhead entirely.

-- Jack Krupansky

On Mon, Feb 23, 2015 at 11:23 AM, Eric Stevens migh...@gmail.com wrote:

 Vnodes is officially disrecommended for DSE Solr integration (though a
 small number isn't ruinous). That might be why they still don't enable them
 by default.
 On Feb 21, 2015 3:58 PM, mck m...@apache.org wrote:

 At least the problem of hadoop and vnodes described in CASSANDRA-6091
 doesn't apply to spark.
  (Spark already allows multiple token ranges per split).

 If this is the reason why DSE hasn't enabled vnodes then fingers crossed
 that'll change soon.


  Some of the DataStax videos that I watched discussed how the Cassandra
 Spark connecter has
  optimizations to deal with vnodes.


 Are these videos public? if so got any link to them?

 ~mck




Re: Why no virtual nodes for Cassandra on EC2?

2015-02-23 Thread Jack Krupansky
Thanks for pointing out a mistake in the doc - that statement (for
Search/Solr) was simply a leftover from before 4.6. Besides, it's in the
Analytics section, which is not relevant for Search/Solr anyway.

-- Jack Krupansky

On Mon, Feb 23, 2015 at 11:54 AM, Eric Stevens migh...@gmail.com wrote:

 30% overhead is pretty brutal.  I think this is basic support for it, and
 not necessarily a recommendation to use it.

 From

 http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/ana/anaNdeOps.html?scroll=anaNdeOps__implicationsVnodes

 *DataStax does not recommend turning on vnodes *for other Hadoop use
 cases *or for Solr nodes*, but you can use vnodes for any Cassandra-only
 cluster, or a Cassandra-only data center in a mixed Hadoop/Solr/Cassandra
 deployment. If you have enabled virtual nodes on Hadoop nodes, disable
 virtual nodes before using the cluster.


 On Mon, Feb 23, 2015 at 9:34 AM, Jack Krupansky jack.krupan...@gmail.com
 wrote:

 DSE 4.6 improved Solr vnode performance dramatically, so that vnodes for
 Search workloads is now no longer officially discouraged. As per the
 official doc for improvements, : *Ability to use virtual nodes (vnodes)
 in Solr nodes. Recommended range: 64 to 256 (overhead increases by
 approximately 30%)*. A vnode token count of 64 or 32 would reduce that
 overhead further. And... the new 4.6 feature of being able to direct a Solr
 query to a specific partition essentially eliminates that overhead entirely.

 -- Jack Krupansky

 On Mon, Feb 23, 2015 at 11:23 AM, Eric Stevens migh...@gmail.com wrote:

 Vnodes is officially disrecommended for DSE Solr integration (though a
 small number isn't ruinous). That might be why they still don't enable them
 by default.
 On Feb 21, 2015 3:58 PM, mck m...@apache.org wrote:

 At least the problem of hadoop and vnodes described in CASSANDRA-6091
 doesn't apply to spark.
  (Spark already allows multiple token ranges per split).

 If this is the reason why DSE hasn't enabled vnodes then fingers crossed
 that'll change soon.


  Some of the DataStax videos that I watched discussed how the
 Cassandra Spark connecter has
  optimizations to deal with vnodes.


 Are these videos public? if so got any link to them?

 ~mck






build failure with cassandra 2.0.12

2015-02-23 Thread Cheng Ren
Hi,
I am experiencing build failure with cassandra 2.0.12. I downloaded source
from http://cassandra.apache.org/download/, did ant mvn-install and got
following error:

[artifact:dependencies] --
[artifact:dependencies] 1 required artifact is missing.
[artifact:dependencies]
[artifact:dependencies] for artifact:
[artifact:dependencies]
org.apache.cassandra:cassandra-coverage-deps:jar:2.0.12-SNAPSHOT
[artifact:dependencies]
[artifact:dependencies] from the specified remote repositories:
[artifact:dependencies]   central (http://repo1.maven.org/maven2)
[artifact:dependencies]
[artifact:dependencies]

BUILD FAILED
/Users/chengren/br/thirdparty/cassandra-2.0.12-br/build.xml:541: Unable to
resolve artifact: Missing:
--
1) com.sun:tools:jar:0

  Try downloading the file manually from the project website.

  Then, install it using the command:
  mvn install:install-file -DgroupId=com.sun -DartifactId=tools
-Dversion=0 -Dpackaging=jar -Dfile=/path/to/file

  Alternatively, if you host your own repository you can deploy the file
there:
  mvn deploy:deploy-file -DgroupId=com.sun -DartifactId=tools
-Dversion=0 -Dpackaging=jar -Dfile=/path/to/file -Durl=[url]
-DrepositoryId=[id]

  Path to dependency:
  1) org.apache.cassandra:cassandra-coverage-deps:jar:2.0.12-SNAPSHOT
  2) net.sourceforge.cobertura:cobertura:jar:2.0.3
  3) com.sun:tools:jar:0

--
1 required artifact is missing.

for artifact:
  org.apache.cassandra:cassandra-coverage-deps:jar:2.0.12-SNAPSHOT

from the specified remote repositories:
  central (http://repo1.maven.org/maven2)


It occurs really weird to me since I could build successfully yesterday.
Did something change behind the scene?

Thanks!


Re: run cassandra on a small instance

2015-02-23 Thread Tim Dunphy

 You might also have some gains setting in_memory_compaction_limit_in_mb
 to something very low to force Cassandra to use on disk compaction rather
 than doing it in memory.


Cool Ben.. thanks I'll add that to my config as well.

Glad that helped. Thanks for reporting back!


No problem, Nate! That's the least I can do. All I can hope is that this
thread adds to the overall fund of knowledge for the list.

Cheers,
Tim



On Mon, Feb 23, 2015 at 11:46 AM, Nate McCall n...@thelastpickle.com
wrote:

 Glad that helped. Thanks for reporting back!

 On Sun, Feb 22, 2015 at 9:12 PM, Tim Dunphy bluethu...@gmail.com wrote:

 Nate,

  Definitely thank you for this advice. After leaving the new Cassandra
 node running on the 2GB instance for the past couple of days, I think I've
 had ample reason to report complete success in getting it stabilized on
 that instance! Here are the changes I've been able to make:

  I think manipulating the key cache and other stuff like concurrent
 writes and some of the other stuff I worked on based on that thread from
 the cassandra list definitely was key in getting Cassandra to work on the
 new instance.

 Check out the before and after (before working/ after working):

 Before in cassandra-env.sh:
MAX_HEAP_SIZE=800M
HEAP_NEWSIZE=200M

 After:
 MAX_HEAP_SIZE=512M
 HEAP_NEWSIZE=100M

 And before in the cassandra.yaml file:

concurrent_writes: 32
compaction_throughput_mb_per_sec: 16
key_cache_size_in_mb:
key_cache_save_period: 14400
# native_transport_max_threads: 128

 And after:

 concurrent_writes: 2
 compaction_throughput_mb_per_sec: 8
 key_cache_size_in_mb: 4
 key_cache_save_period: 0
 native_transport_max_threads: 4


 That really made the difference. I'm a puppet user, so these changes are
 in puppet. So any new 2GB instances I should bring up on Digital Ocean
 should absolutely work the way the first 2GB node does, there.  But I was
 able to make enough sense of your chef recipe to adapt what you were
 showing me.

 Thanks again!
 Tim

 On Fri, Feb 20, 2015 at 10:31 PM, Tim Dunphy bluethu...@gmail.com
 wrote:

 The most important things to note:
 - don't include JNA (it needs to lock pages larger than what will be
 available)
 - turn down threadpools for transports
 - turn compaction throughput way down
 - make concurrent reads and writes very small
 I have used the above run a healthy 5 node clusters locally in it's own
 private network with a 6th monitoring server for light to moderate local
 testing in 16g of laptop ram. YMMV but it is possible.


 Thanks!! That was very helpful. I just tried applying your suggestions
 to my cassandra.yaml file. I used the info from your chef recipe. Well like
 I've been saying typically it takes about 5 hours or so for this situation
 to shake itself out. I'll provide an update to the list once I have a
 better idea of how this is working.

 Thanks again!
 Tim

 On Fri, Feb 20, 2015 at 9:37 PM, Nate McCall n...@thelastpickle.com
 wrote:

 I frequently test with multi-node vagrant-based clusters locally. The
 following chef attributes should give you an idea of what to turn down in
 cassandra.yaml and cassandra-env.sh to build a decent testing cluster:

   :cassandra = {'cluster_name' = 'VerifyCluster',
  'package_name' = 'dsc20',
  'version' = '2.0.11',
  'release' = '1',
  'setup_jna' = false,
  'max_heap_size' = '512M',
  'heap_new_size' = '100M',
  'initial_token' = server['initial_token'],
  'seeds' = 192.168.33.10,
  'listen_address' = server['ip'],
  'broadcast_address' = server['ip'],
  'rpc_address' = server['ip'],
  'conconcurrent_reads' = 2,
  'concurrent_writes' = 2,
  'memtable_flush_queue_size' = 2,
  'compaction_throughput_mb_per_sec' = 8,
  'key_cache_size_in_mb' = 4,
  'key_cache_save_period' = 0,
  'native_transport_min_threads' = 2,
  'native_transport_max_threads' = 4,
  'notify_restart' = true,
  'reporter' = {
'riemann' = {
  'enable' = true,
  'host' = '192.168.33.51'
},
'graphite' = {
  'enable' = true,
  'host' = '192.168.33.51'
}
  }
},

 The most important things to note:
 - don't include JNA (it needs to lock pages larger than what will be
 available)
 - turn down threadpools for 

Re: Why no virtual nodes for Cassandra on EC2?

2015-02-23 Thread Eric Stevens
That link is the one from the 4.6 New Features page:
http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/newFeatures.html

   - Ability to use virtual nodes (vnodes)
   
http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/ana/anaNdeOps.html#anaNdeOps__implicationsVnodes
in
   Solr nodes. Recommended range: 64 to 256 (overhead increases by
   approximately 30%)

Anyway, thanks for clearing this up Jack.  This overhead is on queries
only, right?



On Mon, Feb 23, 2015 at 10:03 AM, Jack Krupansky jack.krupan...@gmail.com
wrote:

 Thanks for pointing out a mistake in the doc - that statement (for
 Search/Solr) was simply a leftover from before 4.6. Besides, it's in the
 Analytics section, which is not relevant for Search/Solr anyway.

 -- Jack Krupansky

 On Mon, Feb 23, 2015 at 11:54 AM, Eric Stevens migh...@gmail.com wrote:

 30% overhead is pretty brutal.  I think this is basic support for it, and
 not necessarily a recommendation to use it.

 From

 http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/ana/anaNdeOps.html?scroll=anaNdeOps__implicationsVnodes

 *DataStax does not recommend turning on vnodes *for other Hadoop use
 cases *or for Solr nodes*, but you can use vnodes for any Cassandra-only
 cluster, or a Cassandra-only data center in a mixed Hadoop/Solr/Cassandra
 deployment. If you have enabled virtual nodes on Hadoop nodes, disable
 virtual nodes before using the cluster.


 On Mon, Feb 23, 2015 at 9:34 AM, Jack Krupansky jack.krupan...@gmail.com
  wrote:

 DSE 4.6 improved Solr vnode performance dramatically, so that vnodes for
 Search workloads is now no longer officially discouraged. As per the
 official doc for improvements, : *Ability to use virtual nodes
 (vnodes) in Solr nodes. Recommended range: 64 to 256 (overhead increases by
 approximately 30%)*. A vnode token count of 64 or 32 would reduce that
 overhead further. And... the new 4.6 feature of being able to direct a Solr
 query to a specific partition essentially eliminates that overhead entirely.

 -- Jack Krupansky

 On Mon, Feb 23, 2015 at 11:23 AM, Eric Stevens migh...@gmail.com
 wrote:

 Vnodes is officially disrecommended for DSE Solr integration (though a
 small number isn't ruinous). That might be why they still don't enable them
 by default.
 On Feb 21, 2015 3:58 PM, mck m...@apache.org wrote:

 At least the problem of hadoop and vnodes described in CASSANDRA-6091
 doesn't apply to spark.
  (Spark already allows multiple token ranges per split).

 If this is the reason why DSE hasn't enabled vnodes then fingers
 crossed
 that'll change soon.


  Some of the DataStax videos that I watched discussed how the
 Cassandra Spark connecter has
  optimizations to deal with vnodes.


 Are these videos public? if so got any link to them?

 ~mck







Re: Why no virtual nodes for Cassandra on EC2?

2015-02-23 Thread Jack Krupansky
Right, and subject to techniques for reducing that overhead that I listed.
In fact, I would recommend simply picking the largest number of tokens for
which the overhead is acceptable for your app, even if it is only 8 or 16
tokens, by 16, 32, or 64 may be sufficient for most apps.

-- Jack Krupansky

On Mon, Feb 23, 2015 at 3:01 PM, Eric Stevens migh...@gmail.com wrote:

 That link is the one from the 4.6 New Features page:
 http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/newFeatures.html

- Ability to use virtual nodes (vnodes)

 http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/ana/anaNdeOps.html#anaNdeOps__implicationsVnodes
  in
Solr nodes. Recommended range: 64 to 256 (overhead increases by
approximately 30%)

 Anyway, thanks for clearing this up Jack.  This overhead is on queries
 only, right?



 On Mon, Feb 23, 2015 at 10:03 AM, Jack Krupansky jack.krupan...@gmail.com
  wrote:

 Thanks for pointing out a mistake in the doc - that statement (for
 Search/Solr) was simply a leftover from before 4.6. Besides, it's in the
 Analytics section, which is not relevant for Search/Solr anyway.

 -- Jack Krupansky

 On Mon, Feb 23, 2015 at 11:54 AM, Eric Stevens migh...@gmail.com wrote:

 30% overhead is pretty brutal.  I think this is basic support for it,
 and not necessarily a recommendation to use it.

 From

 http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/ana/anaNdeOps.html?scroll=anaNdeOps__implicationsVnodes

 *DataStax does not recommend turning on vnodes *for other Hadoop use
 cases *or for Solr nodes*, but you can use vnodes for any
 Cassandra-only cluster, or a Cassandra-only data center in a mixed
 Hadoop/Solr/Cassandra deployment. If you have enabled virtual nodes on
 Hadoop nodes, disable virtual nodes before using the cluster.


 On Mon, Feb 23, 2015 at 9:34 AM, Jack Krupansky 
 jack.krupan...@gmail.com wrote:

 DSE 4.6 improved Solr vnode performance dramatically, so that vnodes
 for Search workloads is now no longer officially discouraged. As per the
 official doc for improvements, : *Ability to use virtual nodes
 (vnodes) in Solr nodes. Recommended range: 64 to 256 (overhead increases by
 approximately 30%)*. A vnode token count of 64 or 32 would reduce
 that overhead further. And... the new 4.6 feature of being able to direct a
 Solr query to a specific partition essentially eliminates that overhead
 entirely.

 -- Jack Krupansky

 On Mon, Feb 23, 2015 at 11:23 AM, Eric Stevens migh...@gmail.com
 wrote:

 Vnodes is officially disrecommended for DSE Solr integration (though a
 small number isn't ruinous). That might be why they still don't enable 
 them
 by default.
 On Feb 21, 2015 3:58 PM, mck m...@apache.org wrote:

 At least the problem of hadoop and vnodes described in CASSANDRA-6091
 doesn't apply to spark.
  (Spark already allows multiple token ranges per split).

 If this is the reason why DSE hasn't enabled vnodes then fingers
 crossed
 that'll change soon.


  Some of the DataStax videos that I watched discussed how the
 Cassandra Spark connecter has
  optimizations to deal with vnodes.


 Are these videos public? if so got any link to them?

 ~mck








memtable_offheap_space_in_mb and memtable_cleanup_threshold

2015-02-23 Thread ssiv...@gmail.com

Hi everyone!

I do write only workload (into one column family) and experiment with 
offheap-objects memtable space.


I set parameters to:/
//memtable_offheap_space_in_mb = 51200  # 50Gb//
//memtable_cleanup_threshold = 0.99/

and expect that flush will not be triggered until available /memtable 
offheap space /reaches ~50Gb. But flushes are triggered
before that limit. System monitor shows that in used only ~16Gb at that 
moment (linux+jvm+heap+...).


Why such thing is happened?

--
Thanks,
Serj



Problem with Cassandra 2.1 and Spark 1.2.1

2015-02-23 Thread Bosung Seo
Hi all,

I'm trying to use Spark and Cassandra.

I have two datacenter in different regions on AWS, and tried ran simple
table count program.

However, I'm still getting * WARN TaskSchedulerImpl: Initial job has not
accepted any resources; * , and Spark can't finish the processing.

The test table only has 571 rows and 2 small columns. I assume it doesn't
require a lot of memory for small table.

I also tried increasing Cores and Ram in Spark config files, but the result
is still same.



scala import com.datastax.spark.connector._
import com.datastax.spark.connector._

scala import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.{SparkContext, SparkConf}

scala val conf = new
SparkConf(true).set(spark.cassandra.connection.host,
172.17.10.44).set(spark.cassandra.auth.username,
masteruser).set(spark.cassandra.auth.password, password)
conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@1cfffdf3

scala val sc = new SparkContext(spark://172.17.10.182:7077, test, conf)
15/02/23 21:56:21 INFO SecurityManager: Changing view acls to: root
15/02/23 21:56:21 INFO SecurityManager: Changing modify acls to: root
15/02/23 21:56:21 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(root); users
with modify permissions: Set(root)
15/02/23 21:56:21 INFO Slf4jLogger: Slf4jLogger started
15/02/23 21:56:21 INFO Remoting: Starting remoting
15/02/23 21:56:21 INFO Utils: Successfully started service 'sparkDriver' on
port 41709.
15/02/23 21:56:21 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkDriver@ip-172-17-10-182:41709]
15/02/23 21:56:21 INFO SparkEnv: Registering MapOutputTracker
15/02/23 21:56:21 INFO SparkEnv: Registering BlockManagerMaster
15/02/23 21:56:21 INFO DiskBlockManager: Created local directory at
/srv/spark/tmp/spark-9f50ea1b-e8eb-4cb8-8f48-d04e3ec525a2/spark-61a2d7fa-697e-4a61-80af-c3d72149f244
15/02/23 21:56:21 INFO MemoryStore: MemoryStore started with capacity 534.5
MB
15/02/23 21:56:21 INFO HttpFileServer: HTTP File server directory is
/srv/spark/tmp/spark-1c34ed81-1ea9-45b1-81dd-184f12b975f6/spark-7c001536-1b70-40ea-9013-14551ad05a29
15/02/23 21:56:21 INFO HttpServer: Starting HTTP Server
15/02/23 21:56:21 INFO Utils: Successfully started service 'HTTP file
server' on port 51439.
15/02/23 21:56:21 INFO Utils: Successfully started service 'SparkUI' on
port 4040.
15/02/23 21:56:21 INFO SparkUI: Started SparkUI at http://52.10.105.190:4040
15/02/23 21:56:21 INFO SparkContext: Added JAR
file:/home/ubuntu/spark-cassandra-connector/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar
at
http://172.17.10.182:51439/jars/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar
with timestamp 1424728581916
15/02/23 21:56:21 INFO AppClient$ClientActor: Connecting to master
spark://172.17.10.182:7077...
15/02/23 21:56:21 INFO SparkDeploySchedulerBackend: Connected to Spark
cluster with app ID app-20150223215621-0010
15/02/23 21:56:21 INFO NettyBlockTransferService: Server created on 45474
15/02/23 21:56:21 INFO BlockManagerMaster: Trying to register BlockManager
15/02/23 21:56:21 INFO BlockManagerMasterActor: Registering block manager
ip-172-17-10-182:45474 with 534.5 MB RAM, BlockManagerId(driver,
ip-172-17-10-182, 45474)
15/02/23 21:56:21 INFO BlockManagerMaster: Registered BlockManager
15/02/23 21:56:22 INFO AppClient$ClientActor: Executor added:
app-20150223215621-0010/0 on worker-20150223191054-ip-172-17-10-45-9000
(ip-172-17-10-45:9000) with 2 cores
15/02/23 21:56:22 INFO SparkDeploySchedulerBackend: Granted executor ID
app-20150223215621-0010/0 on hostPort ip-172-17-10-45:9000 with 2 cores,
512.0 MB RAM
15/02/23 21:56:22 INFO AppClient$ClientActor: Executor added:
app-20150223215621-0010/1 on worker-20150223191054-ip-172-17-10-47-9000
(ip-172-17-10-47:9000) with 2 cores
15/02/23 21:56:22 INFO SparkDeploySchedulerBackend: Granted executor ID
app-20150223215621-0010/1 on hostPort ip-172-17-10-47:9000 with 2 cores,
512.0 MB RAM
15/02/23 21:56:22 INFO AppClient$ClientActor: Executor added:
app-20150223215621-0010/2 on worker-20150223191055-ip-172-17-10-46-9000
(ip-172-17-10-46:9000) with 2 cores
15/02/23 21:56:22 INFO SparkDeploySchedulerBackend: Granted executor ID
app-20150223215621-0010/2 on hostPort ip-172-17-10-46:9000 with 2 cores,
512.0 MB RAM
15/02/23 21:56:22 INFO AppClient$ClientActor: Executor added:
app-20150223215621-0010/3 on worker-20150223191051-ip-172-17-10-44-9000
(ip-172-17-10-44:9000) with 2 cores
15/02/23 21:56:22 INFO SparkDeploySchedulerBackend: Granted executor ID
app-20150223215621-0010/3 on hostPort 

Re: AMI to use to launch a cluster with OpsCenter on AWS

2015-02-23 Thread Carlos Rolo
Regarding AWS the only thing I normally do (besides the normal
installation, etc) is setting up the firewall zones so the ports needed for
Cassandra are open.

You can follow this guide:
https://razvantudorica.com/02/create-a-cassandra-cluster-with-opscenter-on-amazon-ec2/a

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
http://linkedin.com/in/carlosjuzarterolo*
Tel: 1649
www.pythian.com

On Sat, Feb 21, 2015 at 4:48 AM, Clint Kelly clint.ke...@gmail.com wrote:

 BTW I was able to use this script:

 https://github.com/joaquincasares/cassandralauncher

 to get a cluster up and running pretty easily on AWS.  Cheers to the
 author for this.

 Still curious for answers to my questions above, but not as urgent.

 Best regards,
 Clint


 On Fri, Feb 20, 2015 at 5:36 PM, Clint Kelly clint.ke...@gmail.com
 wrote:

 Hi all,

 I am trying to follow the instructions here for installing DSE 4.6 on AWS:


 http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/install/installAMIOpsc.html

 I was successful creating a single-node instance running OpsCenter, which
 I intended to bootstrap creating a larger cluster running Cassandra and
 Spark.

 During my first attempt, however, OpsCenter reported problems talking to
 agents in the new cluster I was creating.  I ssh'ed into one of the new
 instances that I created with OpsCenter and saw that this was the problem:

 

 DataStax AMI for DataStax Enterprise

 and DataStax Community

 AMI version 2.4



 

 DataStax AMI 2.5 released 02.25.2014

 http://goo.gl/g1RRd7


 This AMI (version 2.4) will be left

 available, but no longer updated.

 



 These notices occurred during the startup of this instance:

 [ERROR] 02/21/15-00:53:01 sudo chown -R cassandra:cassandra
 /mnt/cassandra:

 [WARN] Permissions not set correctly. Please run manually:

 [WARN] sudo chown -hR cassandra:cassandra /mnt/cassandra

 [WARN] sudo service dse restart


 It looks like by default, the OpsCenter GUI selects an out-of-date AMI
 (ami-4c32ba7c)
 when you click on Create Cluster and attempt to create a brand-new
 cluster on EC2.

 What is the recommended image to use here?  I found a version 2.5.1 of
 the autoclustering AMI (
 http://thecloudmarket.com/image/ami-ada2b6c4--datastax-auto-clustering-ami-2-5-1-hvm).
 Is that correct?  Or should I be using one of the regular AMIs listed at
 http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/install/installAMIOpsc.html
 ?  Or just a standard ubuntu image?

 FWIW, I tried just using one of the AMIs listed on the DSE 4.6 page
 (ami-32f7c977), and I still see the Waiting for the agent to start
 message, although if I log in, things look like they have kind of worked:

 Cluster started with these options:

 None


 Raiding complete.


 Waiting for nodetool...

 The cluster is now in it's finalization phase. This should only take a
 moment...


 Note: You can also use CTRL+C to view the logs if desired:

 AMI log: ~/datastax_ami/ami.log

 Cassandra log: /var/log/cassandra/system.log



 Datacenter: us-west-2

 =

 Status=Up/Down

 |/ State=Normal/Leaving/Joining/Moving

 --  Address  Load   Tokens  Owns (effective)  Host ID
   Rack

 UN  10.28.24.19  51.33 KB   256 100.0%
  ce0d365a-d58b-4700-b861-9f30af400476  2a



 Opscenter: http://ec2-54-71-102-180.us-west-2.compute.amazonaws.com:/

 Please wait 60 seconds if this is the cluster's first start...



 Tools:

 Run: datastax_tools

 Demos:

 Run: datastax_demos

 Support:

 Run: datastax_support



 

 DataStax AMI for DataStax Enterprise

 and DataStax Community

 AMI version 2.5

 DataStax Community version 2.1.3-1


 



 These notices occurred during the startup of this instance:

 [ERROR] 02/21/15-01:18:55 sudo chown opscenter-agent:opscenter-agent
 /var/lib/datastax-agent/conf:

 [ERROR] 02/21/15-01:19:04 sudo chown -R opscenter-agent:opscenter-agent
 /var/log/datastax-agent:

 [ERROR] 02/21/15-01:19:04 sudo chown -R opscenter-agent:opscenter-agent
 /mnt/datastax-agent:


 I would appreciate any help...  I assume what I'm trying to do here is
 pretty common.





-- 


--





Re: One node taking more resources than others in the ring

2015-02-23 Thread Jonathan Haddad
If you're not using prepared statements you won't get any token aware
routing. That's an even better option than round robin since it reduces the
number of nodes involved.
On Mon, Feb 23, 2015 at 4:48 PM Robert Coli rc...@eventbrite.com wrote:

 On Mon, Feb 23, 2015 at 3:42 PM, Jaydeep Chovatia 
 chovatia.jayd...@gmail.com wrote:

 I have created different tables and my test application reads/writes with
 CL=QUORUM. Under load I found that my one node is taking more
 resources (double CPU) than the other two. I have also verified that
 there is no other process causing this problem.


 My bold prediction is that you are sending all client connections to this
 node. Don't do that, round-robin them.

 =Rob




Efficient .net client for cassandra

2015-02-23 Thread Asit KAUSHIK
Hi All,

We have been able to find our case specific full text which we are
analyzing using Staratio Cassandra. It has modified secondary index api
which uses lucene indices. The erformace also seems good to me . Still i
wanted to ask you gurus

1) Has anybody used Startio and any drawbacks of it
2) We are using .Net as the client to extract data which lacks performance
. I am using the tradition connection pooling and then executing the
prepared statement. So anybody who is using any specific client for .net
would help me on this

Thanks in advance for the help

Thanks and Regards
Asit

On Mon, Feb 23, 2015 at 1:09 PM, Asit KAUSHIK asitkaushikno...@gmail.com
wrote:

 Hi All,

 We have been able to find our case specific full text which we are
 analyzing using Staratio Cassandra. It has modified secondary index api
 which uses lucene indices. The erformace also seems good to me . Still i
 wanted to ask you gurus

 1) Has anybody used Startio and any drawbacks of it
 2) We are using .Net as the client to extract data which lacks performance
 . I am using the tradition connection pooling and then executing the
 prepared statement. So anybody who is using any specific client for .net
 would help me on this

 Thanks in advance for the help

 Thanks and Regards
 Asit



Re: One node taking more resources than others in the ring

2015-02-23 Thread Robert Coli
On Mon, Feb 23, 2015 at 3:42 PM, Jaydeep Chovatia 
chovatia.jayd...@gmail.com wrote:

 I have created different tables and my test application reads/writes with
 CL=QUORUM. Under load I found that my one node is taking more
 resources (double CPU) than the other two. I have also verified that there
 is no other process causing this problem.


My bold prediction is that you are sending all client connections to this
node. Don't do that, round-robin them.

=Rob


Re: One node taking more resources than others in the ring

2015-02-23 Thread Robert Coli
On Mon, Feb 23, 2015 at 5:18 PM, Jonathan Haddad j...@jonhaddad.com wrote:

 If you're not using prepared statements you won't get any token aware
 routing. That's an even better option than round robin since it reduces the
 number of nodes involved.


Fair statement. Thrust of my comment is don't send all connections to that
node. :D

=Rob


One node taking more resources than others in the ring

2015-02-23 Thread Jaydeep Chovatia
Hi,

I have three node cluster with RF=1 (only one Datacenter) with following
size:

Datacenter: DC1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  AddressLoad   Tokens  Owns   Host ID
Rack
UN  IP1  4.02 GB1   33.3%  ID1  RAC1
UN  IP2  4.05 GB1   33.3%  ID2  RAC2
UN  IP3  4.05 GB1   33.3%  ID3  RAC3

I have created different tables and my test application reads/writes with
CL=QUORUM. Under load I found that my one node is taking more
resources (double CPU) than the other two. I have also verified that there
is no other process causing this problem.
My hardware configuration on all nodes is same around Linux + 64-bit + 24
core + 64GB + 1TB
My Cassandra version is 2.0 and JDK 1.7

Jaydeep