答复: Cassandra 2.1.0 Crashes the JVM with OOM with heaps of memory free

2014-12-19 Thread 谢良
​What's your vm.max_map_count​ setting?


Best Regards,

Liang


发件人: Leon Oosterwijk leon.oosterw...@macquarie.com
发送时间: 2014年12月19日 11:55
收件人: user@cassandra.apache.org
主题: Cassandra 2.1.0 Crashes the JVM with OOM with heaps of memory free

All,

We have a Cassandra cluster which seems to be struggling a bit. I have one node 
which crashes continually, and others which crash sporadically. When they crash 
it’s with a JVM couldn’t allocate memory, even though there’s heaps available. 
I suspect it’s because one table which is very big. (500GB) which has on the 
order of 500K-700K files in its directory. When I delete the directory contents 
on the crashing node and ran a repair, the nodes around this node crashed while 
streaming the data. Here is the relevant bits from the crash file and 
environment.

Any help would be appreciated.

#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 12288 bytes for committing 
reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
#   In 32 bit mode, the process size limit was hit
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Use 64 bit Java on a 64 bit OS
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (os_linux.cpp:2671), pid=1104, tid=139950342317824
#
# JRE version: Java(TM) SE Runtime Environment (8.0_20-b26) (build 1.8.0_20-b26)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.20-b23 mixed mode linux-amd64 
compressed oops)
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try ulimit -c unlimited before starting Java again
#

---  T H R E A D  ---

Current thread (0x7f4acabb1800):  JavaThread Thread-13 [_thread_new, 
id=19171, stack(0x7f48ba6ca000,0x7f48ba70b000)]

Stack: [0x7f48ba6ca000,0x7f48ba70b000],  sp=0x7f48ba709a50,  free 
space=254k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xa76cea]  VMError::report_and_die()+0x2ca
V  [libjvm.so+0x4e52fb]  report_vm_out_of_memory(char const*, int, unsigned 
long, VMErrorType, char const*)+0x8b
V  [libjvm.so+0x8e4ec3]  os::Linux::commit_memory_impl(char*, unsigned long, 
bool)+0x103
V  [libjvm.so+0x8e4f8c]  os::pd_commit_memory(char*, unsigned long, bool)+0xc
V  [libjvm.so+0x8dce4a]  os::commit_memory(char*, unsigned long, bool)+0x2a
V  [libjvm.so+0x8e33af]  os::pd_create_stack_guard_pages(char*, unsigned 
long)+0x7f
V  [libjvm.so+0xa21bde]  JavaThread::create_stack_guard_pages()+0x5e
V  [libjvm.so+0xa29954]  JavaThread::run()+0x34
V  [libjvm.so+0x8e75f8]  java_start(Thread*)+0x108
C  [libpthread.so.0+0x79d1]


Memory: 4k page, physical 131988232k(694332k free), swap 37748728k(37748728k 
free)

vm_info: Java HotSpot(TM) 64-Bit Server VM (25.20-b23) for linux-amd64 JRE 
(1.8.0_20-b26), built on Jul 30 2014 13:13:52 by java_re with gcc 4.3.0 
20080428 (Red Hat 4.3.0-8)

time: Fri Dec 19 14:37:29 2014
elapsed time: 2303 seconds (0d 0h 38m 23s)

OS:Red Hat Enterprise Linux Server release 6.5 (Santiago)

uname:Linux 2.6.32-431.5.1.el6.x86_64 #1 SMP Fri Jan 10 14:46:43 EST 2014 x86_64
libc:glibc 2.12 NPTL 2.12
rlimit: STACK 10240k, CORE 0k, NPROC 8192, NOFILE 65536, AS infinity
load average:4.18 4.79 4.54

/proc/meminfo:
MemTotal:   131988232 kB
MemFree:  694332 kB
Buffers:  837584 kB
Cached: 51002896 kB
SwapCached:0 kB
Active: 93953028 kB
Inactive:   32850628 kB
Active(anon):   70851112 kB
Inactive(anon):  4713848 kB
Active(file):   23101916 kB
Inactive(file): 28136780 kB
Unevictable:   0 kB
Mlocked:   0 kB
SwapTotal:  37748728 kB
SwapFree:   37748728 kB
Dirty: 75752 kB
Writeback: 0 kB
AnonPages:  74963768 kB
Mapped:   739884 kB
Shmem:601592 kB
Slab:3460252 kB
SReclaimable:3170124 kB
SUnreclaim:   290128 kB
KernelStack:   36224 kB
PageTables:   189772 kB
NFS_Unstable:  0 kB
Bounce:0 kB
WritebackTmp:  0 kB
CommitLimit:169736960 kB
Committed_AS:   92208740 kB
VmallocTotal:   34359738367 kB
VmallocUsed:  492032 kB
VmallocChunk:   34291733296 kB
HardwareCorrupted: 0 kB
AnonHugePages:  67717120 kB
HugePages_Total:   0
HugePages_Free:0
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:   2048 kB
DirectMap4k:5056 kB
DirectMap2M: 2045952 kB
DirectMap1G:132120576 kB

Before you say It’s a ulimit issue:
[501] ulimit -a
core file size  (blocks, -c) 0
data seg size   (kbytes, -d) unlimited
scheduling priority  

Re: In place vnode conversion possible?

2014-12-19 Thread Jonas Borgström
On 18/12/14 21:45, Robert Coli wrote:
 On Tue, Dec 16, 2014 at 12:38 AM, Jonas Borgström jo...@borgstrom.se
 mailto:jo...@borgstrom.se wrote:
 
 That said, I've done some testing and it appears to be possible to
 perform an in place conversion as long as all nodes contain all data (3
 nodes and replication factor 3 for example) like this:
 
 
 I would expect this to work, but to stream up to RF x the data around.

Why would any streaming take place?

Simply changing the tokens and restarting a node does not seem to
trigger any streaming.

And if I manually trigger a nodetool repair I notice almost no
streaming since all nodes were already responsible for 100% of the data
(RF = NUM_NODES).

/ Jonas




signature.asc
Description: OpenPGP digital signature


Reset cfhistograms

2014-12-19 Thread nitin padalia
Hi,
I am using cassandra 2.1.2 with 5 node cluster single DC.
I've read that histograms are reset after node restart or rerun of command.
But in my case it's not resetting by running every time.
Could someone point what could be the issue or how could I reset it without
restarting node.
Thanks! in advance.
-Nitin


Multi DC informations (sync)

2014-12-19 Thread Alain RODRIGUEZ
Hi guys,

We expanded our cluster to a multiple DC configuration.

Now I am wondering if there is any way to know:

1 - The replication lag between these 2 DC (Opscenter, nodetool, other ?)
2 - Make sure that sync is ok at any time

I guess big companies running Cassandra are interested in these kind of
info, so I think something exist but I am not aware of it.

Any other important information or advice you can give me about best
practices or tricks while running a multi DC (cross regions US - EU) is
welcome of course !

cheers,

Alain


Re: 2014 nosql benchmark

2014-12-19 Thread Philo Yang
Today I've also seen this benchmark in Chinese websites. SequoiaDB seems
come from a Chinese startup company, and in db-engines ranking
http://db-engines.com/en/ranking it's score is 0.00. So IMO I have to say
I think this benchmark is a soft sell. They compare three databases, two
written by c++ and one by java, and use a very tricky testcase to make
Cassandra can not hold all data in memtables.  After all, java need more
memory than c++. For a on-disk database, generally data size of one node is
much larger than RAM, and it's performance of memory query is less
important than disk query.

So I think this benchmark have no value at all.

2014-12-19 14:47 GMT+08:00 Wilm Schumacher wilm.schumac...@gmail.com:

  Hi,

 I'm always interessted in such benchmark experiments, because the
 databases evolve so fast, that the race is always open and there is a lot
 motion in there.

 And of course I askes myself the same question. And I think that this
 publication is unreliable. For 4 reasons (from reading very fast, perhaps
 there is more):

 1.) It is unclear what this is all about. The title is NoSQL Performance
 Testing. The subtitle is In-Memory Performance Comparison of SequoiaDB,
 Cassandra,  and MongoDB. However, in the introduction there is not one
 word about in memory performance. The introduction could be a general
 introduction for a general on-disk-nosql benchmark. So ... only the
 subtitle (and a short sentence in the Result Summary) says what this is
 actually about.

 2.) There are very important databases missing. For in memory e.g.
 redis. If e.g. redis is not a valid candidate in this race, why is this
 so?MySQL is capable of in memory distributed databanking, too.

 3.) The methodology is unclear. Perhaps I'm the only one, but what does
 Run workload for 30 minutes (workload file workload[1-5])  mean for mixed
 read/write ops? Why 30 min? Okay, I can image, that the authors estimated
 the throughput, preset the number of 100 Mio rows and designed it to be
 larger than the estimated throughput in x minutes. However, all this
 information is missing. And why 45% and 22% of RAM? My first Idea would be
 a VERY low ration, like 2% or so, and a VERY large ratio, like 80-90%. And
 than everything in between. Is 22% or 45% somehow a magic number?
 Furthermore in the Result summary there 1/2 and 1/4 of RAM are discussed.
 Okay, 22% is near 1/4 ... but where does the difference origin from? And
 btw. ... 22% of what? Stuff to insert? Stuff already insererted? It's all
 deductable, but it's strange that the description is so sloppy.

 4.) There is no repetion of the loads (as I understand). Its one run, one
 result ... and it's done. I don't know a lot of cassandra in in-memory use.
 But either the experiment should be repeated quite some runs OR it should
 be explained why this is not neccessary.

 Okay, perhaps 1 is a little picky, and 4 is a little fussy. But 3 is
 strange and 2 stinks.

 Well, just my first impression. And that's Cassandra is very fast ;).

 Best regards

 Wilm


 Am 19.12.2014 um 06:41 schrieb diwayou:

   i just have read this benchmark pdf, does anyone have some opinion
 about this?
 i think it's not fair about cassandra
 url:
 http://www.bankmark.de/wp-content/uploads/2014/12/bankmark-20141201-WP-NoSQLBenchmark.pdf
 ‍
 http://msrg.utoronto.ca/papers/NoSQLBenchmark‍





Re: Multi DC informations (sync)

2014-12-19 Thread Jens Rantil
Alain,




AFAIK, the DC replication is not linearizable. That is, writes are are not 
replicated according to a binlog or similar like MySQL. They are replicated 
concurrently.




To answer you questions:

1 - Replication lag in Cassandra terms is probably “Hinted handoff”. You’d want 
to check the status of that.

2 - `nodetool status` is your friend. It will tell you whether the cluster 
considers other nodes reachable or not. Run it on a node in the datacenter that 
you’d like to test connectivity from.




Cheers,

Jens


———
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook Linkedin Twitter

On Fri, Dec 19, 2014 at 11:16 AM, Alain RODRIGUEZ arodr...@gmail.com
wrote:

 Hi guys,
 We expanded our cluster to a multiple DC configuration.
 Now I am wondering if there is any way to know:
 1 - The replication lag between these 2 DC (Opscenter, nodetool, other ?)
 2 - Make sure that sync is ok at any time
 I guess big companies running Cassandra are interested in these kind of
 info, so I think something exist but I am not aware of it.
 Any other important information or advice you can give me about best
 practices or tricks while running a multi DC (cross regions US - EU) is
 welcome of course !
 cheers,
 Alain

Re: Understanding tombstone WARN log output

2014-12-19 Thread Jens Rantil
Hi again,




A follow-up question (to my yet unanswered question):



How come the first localDeletion is Integer.MAX_VALUE above? Should it be?




Cheers,

Jens






———
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook Linkedin Twitter

On Thu, Dec 18, 2014 at 2:48 PM, Jens Rantil jens.ran...@tink.se wrote:

 Hi,
 I am occasionally seeing:
  WARN [ReadStage:9576] 2014-12-18 11:16:19,042 SliceQueryFilter.java (line
 225) Read 756 live and 17027 tombstoned cells in mykeyspace.mytable (see
 tombstone_warn_threshold). 5001 columns was requested,
 slices=[73c31274-f45c-4ba5-884a-6d08d20597e7:myfield-],
 delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647,
 ranges=[73f0b59e-7525-4a18-a84f-d2a2f0505503-73f0b59e-7525-4a18-a84f-d2a2f0505503:!,
 deletedAt=141872018676,
 localDeletion=1418720186][74374d72-2688-4e64-bb0b-f51a956b0529-74374d72-2688-4e64-bb0b-f51a956b0529:!,
 deletedAt=1418720184675000, localDeletion=1418720184] ...
 in system.log. My primary key is ((userid uuid), id uuid). Is it possible
 for me to see from this output which partition key and/or ranges that has
 all of these tombstones?
 Thanks,
 Jens
 -- 
 Jens Rantil
 Backend engineer
 Tink AB
 Email: jens.ran...@tink.se
 Phone: +46 708 84 18 32
 Web: www.tink.se
 Facebook https://www.facebook.com/#!/tink.se Linkedin
 http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
  Twitter https://twitter.com/tink

Drivers performance

2014-12-19 Thread Svec, Michal
htmlbodyHello,
I am in the middle of evaluating whether we should switch from Astyanax to 
datastax driver and I did simple benchmark that load 10 000 times the same row 
by key and I was surprised with the slowness of datastax driver. I uploaded it 
to github.
https://github.com/michalsvec/astyanax-datastax-benchmark

It was tested against Cassandra 1.2 and 2.1. Testing conditions were naive 
(localhost, single node, ...) but still the difference is huge.

10 000 iterations:

* Astyanax:2734 ms

* Astyanax prepared:1997 ms

* Datastax:10230 ms

Is it really so slow or do I miss something?

Thank you for any advice.
Michal


NOTICE: This email and any attachments may contain confidential and proprietary 
information of NetSuite Inc. and is for the sole use of the intended recipient 
for the stated purpose. Any improper use or distribution is prohibited. If you 
are not the intended recipient, please notify the sender; do not review, copy 
or distribute; and promptly delete or destroy all transmitted information. 
Please note that all communications and information transmitted through this 
email system may be monitored by NetSuite or its agents and that all incoming 
email is automatically scanned by a third party spam and filtering service

/body/html




Re: Multi DC informations (sync)

2014-12-19 Thread Alain RODRIGUEZ
Hi Jens, thanks for your insight.

Replication lag in Cassandra terms is probably “Hinted handoff” -- Well I
think hinted handoff are only used when a node is down, and are not even
mandatory enabled. I guess that cross DC async replication is something
else, taht has nothing to see with hinted handoff, am I wrong ?

`nodetool status` is your friend. It will tell you whether the cluster
considers other nodes reachable or not. Run it on a node in the datacenter
that you’d like to test connectivity from. -- Connectivity ≠ write success

Basically the two question can be changed this way:

1 - How to monitor the async cross dc write latency ?
2 - What error should I look for when async write fails (if any) ? Or is
there any other way to see that network throughput (for example) is too
small for a given traffic.

Hope this is clearer.

C*heers,

Alain

2014-12-19 11:44 GMT+01:00 Jens Rantil jens.ran...@tink.se:

 Alain,

 AFAIK, the DC replication is not linearizable. That is, writes are are not
 replicated according to a binlog or similar like MySQL. They are replicated
 concurrently.

 To answer you questions:
 1 - Replication lag in Cassandra terms is probably “Hinted handoff”. You’d
 want to check the status of that.
 2 - `nodetool status` is your friend. It will tell you whether the cluster
 considers other nodes reachable or not. Run it on a node in the datacenter
 that you’d like to test connectivity from.

 Cheers,
 Jens

 ——— Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se
 Phone: +46 708 84 18 32 Web: www.tink.se Facebook Linkedin Twitter


 On Fri, Dec 19, 2014 at 11:16 AM, Alain RODRIGUEZ arodr...@gmail.com
 wrote:

 Hi guys,

 We expanded our cluster to a multiple DC configuration.

 Now I am wondering if there is any way to know:

 1 - The replication lag between these 2 DC (Opscenter, nodetool, other ?)
 2 - Make sure that sync is ok at any time

 I guess big companies running Cassandra are interested in these kind of
 info, so I think something exist but I am not aware of it.

 Any other important information or advice you can give me about best
 practices or tricks while running a multi DC (cross regions US - EU) is
 welcome of course !

 cheers,

 Alain





Re: Drivers performance

2014-12-19 Thread Ryan Svihla
Better question for the java driver mailing list, but I see a number of
problems in your Datastax java driver code, and without knowing the way
Astyanax handles caching of prepared statements I can tell you

   1. You're re repreparing a statement on _every_ iteration, and these are
   not cached by the driver. This is not only expensive, it is slower than
   just using non prepared statements. This is a substantial slow down.
   Drivers are not necessarily implementing this the same way so the code is
   not apples to apples. Change your code to prepare _once_ and I bet your
   numbers improve drastically.
   2. Your pooling options are CRAZY high, and I'm guessing your'e running
   out of resources on the datastax driver, again the code is different with
   different tradeoffs from Astyanax , a connection in thrift is not remotely
   the same as a connection in the modern remote protocol. Just use the
   default pooling options and I bet your numbers improve greatly (if not
   there is something deeply off about your cluster and or app servers).
   3. A lot of the speed up in the java driver is in the async support and
   how the native protocol handles async, since you're doing synchronous this
   is the best case for thrift performance, however that still does not
   explain your gap ( which in most synchronous cases is thrift is comparable
   at best, but usually not faster ).
   4. I haven't been able to figure out which version of the Datastax
   driver your on from looking at the code, this can change performance
   drastically as there has been many improvements, especially for Cassandra
   2.1

I suggest you reply to the java driver mailing list for more in depth
discussion
https://groups.google.com/a/lists.datastax.com/forum/#!forum/java-driver-user

On Fri, Dec 19, 2014 at 7:26 AM, Svec, Michal ms...@netsuite.com wrote:


  Hello,

 I am in the middle of evaluating whether we should switch from Astyanax to
 datastax driver and I did simple benchmark that load 10 000 times the same
 row by key and I was surprised with the slowness of datastax driver. I
 uploaded it to github.

 https://github.com/michalsvec/astyanax-datastax-benchmark



 It was tested against Cassandra 1.2 and 2.1. Testing conditions were naive
 (localhost, single node, …) but still the difference is huge.



 10 000 iterations:

 · Astyanax:2734 ms

 · Astyanax prepared:1997 ms

 · Datastax:10230 ms



 Is it really so slow or do I miss something?



 Thank you for any advice.

 Michal




  NOTICE: This email and any attachments may contain confidential and
 proprietary information of NetSuite Inc. and is for the sole use of the
 intended recipient for the stated purpose. Any improper use or distribution
 is prohibited. If you are not the intended recipient, please notify the
 sender; do not review, copy or distribute; and promptly delete or destroy
 all transmitted information. Please note that all communications and
 information transmitted through this email system may be monitored and
 retained by NetSuite or its agents and that all incoming email is
 automatically scanned by a third party spam and filtering service which may
 result in deletion of a legitimate e-mail before it is read by the intended
 recipient.




-- 

[image: datastax_logo.png] http://www.datastax.com/

Ryan Svihla

Solution Architect

[image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
http://www.linkedin.com/pub/ryan-svihla/12/621/727/

DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.


Re: 答复: Cassandra 2.1.0 Crashes the JVM with OOM with heaps of memory free

2014-12-19 Thread Ryan Svihla
It does appear to be a ulimit issue to some degree as some settings are
lower than recommended by a few factors (namely nproc).

http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html

* - memlock unlimited
* - nofile 10
* - nproc 32768
* - as unlimited

However, I'm also confident you have other issues as well, that are going
to be problematic. Namely what is your heap setting at? can you grep for
ERROR, WARN, dropped, GCInspector in the system.log for Cassandra and share
the results?


On Fri, Dec 19, 2014 at 2:23 AM, 谢良 xieli...@xiaomi.com wrote:

  ​What's your vm.max_map_count​ setting?


  Best Regards,

 Liang
  --
 *发件人:* Leon Oosterwijk leon.oosterw...@macquarie.com
 *发送时间:* 2014年12月19日 11:55
 *收件人:* user@cassandra.apache.org
 *主题:* Cassandra 2.1.0 Crashes the JVM with OOM with heaps of memory free


 All,



 We have a Cassandra cluster which seems to be struggling a bit. I have one
 node which crashes continually, and others which crash sporadically. When
 they crash it’s with a JVM couldn’t allocate memory, even though there’s
 heaps available. I suspect it’s because one table which is very big.
 (500GB) which has on the order of 500K-700K files in its directory. When I
 delete the directory contents on the crashing node and ran a repair, the
 nodes around this node crashed while streaming the data. Here is the
 relevant bits from the crash file and environment.



 Any help would be appreciated.



 #

 # There is insufficient memory for the Java Runtime Environment to
 continue.

 # Native memory allocation (mmap) failed to map 12288 bytes for committing
 reserved memory.

 # Possible reasons:

 #   The system is out of physical RAM or swap space

 #   In 32 bit mode, the process size limit was hit

 # Possible solutions:

 #   Reduce memory load on the system

 #   Increase physical memory or swap space

 #   Check if swap backing store is full

 #   Use 64 bit Java on a 64 bit OS

 #   Decrease Java heap size (-Xmx/-Xms)

 #   Decrease number of Java threads

 #   Decrease Java thread stack sizes (-Xss)

 #   Set larger code cache with -XX:ReservedCodeCacheSize=

 # This output file may be truncated or incomplete.

 #

 #  Out of Memory Error (os_linux.cpp:2671), pid=1104, tid=139950342317824

 #

 # JRE version: Java(TM) SE Runtime Environment (8.0_20-b26) (build
 1.8.0_20-b26)

 # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.20-b23 mixed mode
 linux-amd64 compressed oops)

 # Failed to write core dump. Core dumps have been disabled. To enable core
 dumping, try ulimit -c unlimited before starting Java again

 #



 ---  T H R E A D  ---



 Current thread (0x7f4acabb1800):  JavaThread Thread-13 [_thread_new,
 id=19171, stack(0x7f48ba6ca000,0x7f48ba70b000)]



 Stack: [0x7f48ba6ca000,0x7f48ba70b000],  sp=0x7f48ba709a50,
 free space=254k

 Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native
 code)

 V  [libjvm.so+0xa76cea]  VMError::report_and_die()+0x2ca

 V  [libjvm.so+0x4e52fb]  report_vm_out_of_memory(char const*, int,
 unsigned long, VMErrorType, char const*)+0x8b

 V  [libjvm.so+0x8e4ec3]  os::Linux::commit_memory_impl(char*, unsigned
 long, bool)+0x103

 V  [libjvm.so+0x8e4f8c]  os::pd_commit_memory(char*, unsigned long,
 bool)+0xc

 V  [libjvm.so+0x8dce4a]  os::commit_memory(char*, unsigned long, bool)+0x2a

 V  [libjvm.so+0x8e33af]  os::pd_create_stack_guard_pages(char*, unsigned
 long)+0x7f

 V  [libjvm.so+0xa21bde]  JavaThread::create_stack_guard_pages()+0x5e

 V  [libjvm.so+0xa29954]  JavaThread::run()+0x34

 V  [libjvm.so+0x8e75f8]  java_start(Thread*)+0x108

 C  [libpthread.so.0+0x79d1]





 Memory: 4k page, physical 131988232k(694332k free), swap
 37748728k(37748728k free)



 vm_info: Java HotSpot(TM) 64-Bit Server VM (25.20-b23) for linux-amd64 JRE
 (1.8.0_20-b26), built on Jul 30 2014 13:13:52 by java_re with gcc 4.3.0
 20080428 (Red Hat 4.3.0-8)



 time: Fri Dec 19 14:37:29 2014

 elapsed time: 2303 seconds (0d 0h 38m 23s)



 OS:Red Hat Enterprise Linux Server release 6.5 (Santiago)



 uname:Linux 2.6.32-431.5.1.el6.x86_64 #1 SMP Fri Jan 10 14:46:43 EST 2014
 x86_64

 libc:glibc 2.12 NPTL 2.12

 rlimit: STACK 10240k, CORE 0k, NPROC 8192, NOFILE 65536, AS infinity

 load average:4.18 4.79 4.54



 /proc/meminfo:

 MemTotal:   131988232 kB

 MemFree:  694332 kB

 Buffers:  837584 kB

 Cached: 51002896 kB

 SwapCached:0 kB

 Active: 93953028 kB

 Inactive:   32850628 kB

 Active(anon):   70851112 kB

 Inactive(anon):  4713848 kB

 Active(file):   23101916 kB

 Inactive(file): 28136780 kB

 Unevictable:   0 kB

 Mlocked:   0 kB

 SwapTotal:  37748728 kB

 SwapFree:   37748728 kB

 Dirty: 75752 kB

 Writeback: 0 kB

 AnonPages:  74963768 kB

 Mapped:   739884 kB

 Shmem:601592 kB

 Slab:

Key Cache Questions

2014-12-19 Thread Batranut Bogdan
Hello all,I just read that the default size of the Key cache is 100 MB. Is it 
stored in memory or disk? 

Re: Multi DC informations (sync)

2014-12-19 Thread Ryan Svihla
More accurately,the write path of Cassandra in a multi dc sense is kinda
like the following

1. write goes to a node which acts as coordinator
2. writes go out to all replicas in that DC, and then one write per remote
DC goes out to another node which takes responsibility for writing to all
replicas in it's data center. The request blocks however until all CL is
satisfied.
3. if any of these writes fail by default a hinted handoff is generated..

So as you can see..there is effectively not lag beyond either raw network
latency+node speed and/or just failed writes and waiting on hint replay to
occur. Likewise repairs can be used to make the data centers back in sync,
and in the case of substantial outages you will need repairs to bring you
back in sync, you're running repairs already right?

Think of Cassandra as a global write, and not a message queue, and you've
got the basic idea.


On Fri, Dec 19, 2014 at 7:54 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:

 Hi Jens, thanks for your insight.

 Replication lag in Cassandra terms is probably “Hinted handoff” -- Well I
 think hinted handoff are only used when a node is down, and are not even
 mandatory enabled. I guess that cross DC async replication is something
 else, taht has nothing to see with hinted handoff, am I wrong ?

 `nodetool status` is your friend. It will tell you whether the cluster
 considers other nodes reachable or not. Run it on a node in the datacenter
 that you’d like to test connectivity from. -- Connectivity ≠ write success

 Basically the two question can be changed this way:

 1 - How to monitor the async cross dc write latency ?
 2 - What error should I look for when async write fails (if any) ? Or is
 there any other way to see that network throughput (for example) is too
 small for a given traffic.

 Hope this is clearer.

 C*heers,

 Alain

 2014-12-19 11:44 GMT+01:00 Jens Rantil jens.ran...@tink.se:

 Alain,

 AFAIK, the DC replication is not linearizable. That is, writes are are
 not replicated according to a binlog or similar like MySQL. They are
 replicated concurrently.

 To answer you questions:
 1 - Replication lag in Cassandra terms is probably “Hinted handoff”.
 You’d want to check the status of that.
 2 - `nodetool status` is your friend. It will tell you whether the
 cluster considers other nodes reachable or not. Run it on a node in the
 datacenter that you’d like to test connectivity from.

 Cheers,
 Jens

 ——— Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se
 Phone: +46 708 84 18 32 Web: www.tink.se Facebook Linkedin Twitter


 On Fri, Dec 19, 2014 at 11:16 AM, Alain RODRIGUEZ arodr...@gmail.com
 wrote:

 Hi guys,

 We expanded our cluster to a multiple DC configuration.

 Now I am wondering if there is any way to know:

 1 - The replication lag between these 2 DC (Opscenter, nodetool, other ?)
 2 - Make sure that sync is ok at any time

 I guess big companies running Cassandra are interested in these kind of
 info, so I think something exist but I am not aware of it.

 Any other important information or advice you can give me about best
 practices or tricks while running a multi DC (cross regions US - EU) is
 welcome of course !

 cheers,

 Alain





-- 

[image: datastax_logo.png] http://www.datastax.com/

Ryan Svihla

Solution Architect

[image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
http://www.linkedin.com/pub/ryan-svihla/12/621/727/

DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.


Re: simple data movement ?

2014-12-19 Thread Langston, Jim
Thanks, this looks uglier , I double checked my production cluster ( I have a 
staging and development cluster as well ) and
production is on 1.2.8. A copy of the data resulted in a mssage :

Exception encountered during startup: Incompatible SSTable found. Current 
version ka is unable to read file: 
/cassandra/apache-cassandra-2.1.2/bin/../data/data/system/schema_keyspaces/system-schema_keyspaces-ic-150.
 Please run upgradesstables.

Is the move going to to be 1.2.8 -- 1.2.9 -- 2.0.x -- 2.1.2 ??

Can I just dump the data and import it into 2.1.2 ??


Jim

From: Ryan Svihla rsvi...@datastax.commailto:rsvi...@datastax.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thu, 18 Dec 2014 06:00:09 -0600
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: simple data movement ?

I'm not sure that'll work with that many version moves in the middle, upgrades 
are to my knowledge only tested between specific steps, namely from 1.2.9 to 
the latest 2.0.x

http://www.datastax.com/documentation/upgrade/doc/upgrade/cassandra/upgradeC_c.html
Specifically:

Cassandra 2.0.x 
restrictions¶http://www.datastax.com/documentation/upgrade/doc/upgrade/cassandra/upgradeC_c.html?scroll=concept_ds_yqj_5xr_ck__section_ubt_nwr_54

After downloading DataStax Communityhttp://planetcassandra.org/cassandra/, 
upgrade to Cassandra directly from Cassandra 1.2.9 or later. Cassandra 2.0 is 
not network- or SSTable-compatible with versions older than 1.2.9. If your 
version of Cassandra is earlier than 1.2.9 and you want to perform a rolling 
restarthttp://www.datastax.com/documentation/cassandra/1.2/cassandra/glossary/gloss_rolling_restart.html,
 first upgrade the entire cluster to 1.2.9, and then to Cassandra 2.0.

Cassandra 2.1.x 
restrictions¶http://www.datastax.com/documentation/upgrade/doc/upgrade/cassandra/upgradeC_c.html?scroll=concept_ds_yqj_5xr_ck__section_qzx_pwr_54

Upgrade to Cassandra 2.1 from Cassandra 2.0.7 or later.

Cassandra 2.1 is not compatible with Cassandra 1.x SSTables. First upgrade the 
nodes to Cassandra 2.0.7 or later, start the cluster, upgrade the SSTables, 
stop the cluster, and then upgrade to Cassandra 2.1.

On Wed, Dec 17, 2014 at 10:55 PM, Ben Bromhead 
b...@instaclustr.commailto:b...@instaclustr.com wrote:
Just copy the data directory from each prod node to your test node (and 
relevant configuration files etc).

If your IP addresses are different between test and prod, follow 
https://engineering.eventbrite.com/changing-the-ip-address-of-a-cassandra-node-with-auto_bootstrapfalse/


On 18 December 2014 at 09:10, Langston, Jim 
jim.langs...@dynatrace.commailto:jim.langs...@dynatrace.com wrote:
Hi all,

I have set up a test environment with C* 2.1.2, wanting to test our
applications against it. I currently have C* 1.2.9 in production and want
to use that data for testing. What would be a good approach for simply
taking a copy of the production data and moving it into the test env and
having the test env C* use that data ?

The test env. is identical is size, with the difference being the versions
of C*.

Thanks,

Jim
The contents of this e-mail are intended for the named addressee only. It 
contains information that may be confidential. Unless you are the named 
addressee or an authorized designee, you may not copy or use it, or disclose it 
to anyone else. If you received it in error please notify us immediately and 
then destroy it


--

Ben Bromhead

Instaclustr | www.instaclustr.comhttps://www.instaclustr.com/ | 
@instaclustrhttp://twitter.com/instaclustr | +61 415 936 
359tel:%2B61%20415%20936%20359


--

[datastax_logo.png]http://www.datastax.com/

Ryan Svihla

Solution Architect

[twitter.png]https://twitter.com/foundev [linkedin.png] 
http://www.linkedin.com/pub/ryan-svihla/12/621/727/


DataStax is the fastest, most scalable distributed database technology, 
delivering Apache Cassandra to the world’s most innovative enterprises. 
Datastax is built to be agile, always-on, and predictably scalable to any size. 
With more than 500 customers in 45 countries, DataStax is the database 
technology and transactional backbone of choice for the worlds most innovative 
companies such as Netflix, Adobe, Intuit, and eBay.

The contents of this e-mail are intended for the named addressee only. It 
contains information that may be confidential. Unless you are the named 
addressee or an authorized designee, you may not copy or use it, or disclose it 
to anyone else. If you received it in error please notify us immediately and 
then destroy it


Re: High Bloom Filter FP Ratio

2014-12-19 Thread Mark Greene
We're seeing similar behavior except our FP ratio is closer to 1.0 (100%).

We're using Cassandra 2.1.2.


Schema
---
CREATE TABLE contacts.contact (
id bigint,
property_id int,
created_at bigint,
updated_at bigint,
value blob,
PRIMARY KEY (id, property_id)
) WITH CLUSTERING ORDER BY (property_id ASC)
*AND bloom_filter_fp_chance = 0.001*
AND caching = '{keys:ALL, rows_per_partition:NONE}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class':
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy',
'max_threshold': '32'}
AND compression = {'sstable_compression':
'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';

CF Stats Output:
-
Keyspace: contacts
Read Count: 2458375
Read Latency: 0.852844076675 ms.
Write Count: 10357
Write Latency: 0.1816912233272183 ms.
Pending Flushes: 0
Table: contact
SSTable count: 61
SSTables in each level: [1, 10, 50, 0, 0, 0, 0, 0, 0]
Space used (live): 9047112471
Space used (total): 9047112471
Space used by snapshots (total): 0
SSTable Compression Ratio: 0.34119240020241487
Memtable cell count: 24570
Memtable data size: 1299614
Memtable switch count: 2
Local read count: 2458290
Local read latency: 0.853 ms
Local write count: 10044
Local write latency: 0.186 ms
Pending flushes: 0
Bloom filter false positives: 11096
*Bloom filter false ratio: 0.99197*
Bloom filter space used: 3923784
Compacted partition minimum bytes: 373
Compacted partition maximum bytes: 152321
Compacted partition mean bytes: 9938
Average live cells per slice (last five minutes): 37.57851240677983
Maximum live cells per slice (last five minutes): 63.0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0.0

--
about.me http://about.me/markgreene

On Wed, Dec 17, 2014 at 1:32 PM, Chris Hart ch...@remilon.com wrote:

 Hi,

 I have create the following table with bloom_filter_fp_chance=0.01:

 CREATE TABLE logged_event (
   time_key bigint,
   partition_key_randomizer int,
   resource_uuid timeuuid,
   event_json text,
   event_type text,
   field_error_list maptext, text,
   javascript_timestamp timestamp,
   javascript_uuid uuid,
   page_impression_guid uuid,
   page_request_guid uuid,
   server_received_timestamp timestamp,
   session_id bigint,
   PRIMARY KEY ((time_key, partition_key_randomizer), resource_uuid)
 ) WITH
   bloom_filter_fp_chance=0.01 AND
   caching='KEYS_ONLY' AND
   comment='' AND
   dclocal_read_repair_chance=0.00 AND
   gc_grace_seconds=864000 AND
   index_interval=128 AND
   read_repair_chance=0.00 AND
   replicate_on_write='true' AND
   populate_io_cache_on_flush='false' AND
   default_time_to_live=0 AND
   speculative_retry='99.0PERCENTILE' AND
   memtable_flush_period_in_ms=0 AND
   compaction={'class': 'SizeTieredCompactionStrategy'} AND
   compression={'sstable_compression': 'LZ4Compressor'};


 When I run cfstats, I see a much higher false positive ratio:

 Table: logged_event
 SSTable count: 15
 Space used (live), bytes: 104128214227
 Space used (total), bytes: 104129482871
 SSTable Compression Ratio: 0.3295840184239226
 Number of keys (estimate): 199293952
 Memtable cell count: 56364
 Memtable data size, bytes: 20903960
 Memtable switch count: 148
 Local read count: 1396402
 Local read latency: 0.362 ms
 Local write count: 2345306
 Local write latency: 0.062 ms
 Pending tasks: 0
 Bloom filter false positives: 147705
 Bloom filter false ratio: 0.49020
 Bloom filter space used, bytes: 249129040
 Compacted partition minimum bytes: 447
 Compacted partition maximum bytes: 315852
 Compacted partition mean bytes: 1636
 Average live cells per slice (last five minutes): 0.0
 Average tombstones per slice (last five minutes): 0.0

 Any idea what could be causing this?  This is timeseries data.  Every time
 we read from this table, we read a single row key with 1000
 partition_key_randomizer values.  I'm running cassandra 2.0.11.  I tried
 running an upgradesstables to rewrite 

Re: Multi DC informations (sync)

2014-12-19 Thread Alain RODRIGUEZ
All that you said match the idea I had of how it works except this part:

The request blocks however until all CL is satisfied -- Does this mean
that the client will see an error if the local DC write the data correctly
(i.e. CL reached) but the remote DC fails ? This is not the idea I had of
something asynchronous...

If it doesn't fail on client side (real asynchronous), is there a way to
make sure remote DC has indeed received the information ? I mean if the
throughput cross regions is to small, the write will fail and so will the
HH, potentially. How to detect we are lacking of throughput cross DC for
example ?

Repairs are indeed a good thing (we run them as a weekly routine, GC grace
period 10 sec), but having inconsistency for a week without knowing it is
quite an issue.

Thanks for this detailed information Ryan, I hope I am clear enough while
expressing my doubts.

C*heers

Alain

2014-12-19 15:43 GMT+01:00 Ryan Svihla rsvi...@datastax.com:

 More accurately,the write path of Cassandra in a multi dc sense is kinda
 like the following

 1. write goes to a node which acts as coordinator
 2. writes go out to all replicas in that DC, and then one write per remote
 DC goes out to another node which takes responsibility for writing to all
 replicas in it's data center. The request blocks however until all CL is
 satisfied.
 3. if any of these writes fail by default a hinted handoff is generated..

 So as you can see..there is effectively not lag beyond either raw
 network latency+node speed and/or just failed writes and waiting on hint
 replay to occur. Likewise repairs can be used to make the data centers back
 in sync, and in the case of substantial outages you will need repairs to
 bring you back in sync, you're running repairs already right?

 Think of Cassandra as a global write, and not a message queue, and you've
 got the basic idea.


 On Fri, Dec 19, 2014 at 7:54 AM, Alain RODRIGUEZ arodr...@gmail.com
 wrote:

 Hi Jens, thanks for your insight.

 Replication lag in Cassandra terms is probably “Hinted handoff” -- Well
 I think hinted handoff are only used when a node is down, and are not even
 mandatory enabled. I guess that cross DC async replication is something
 else, taht has nothing to see with hinted handoff, am I wrong ?

 `nodetool status` is your friend. It will tell you whether the cluster
 considers other nodes reachable or not. Run it on a node in the datacenter
 that you’d like to test connectivity from. -- Connectivity ≠ write success

 Basically the two question can be changed this way:

 1 - How to monitor the async cross dc write latency ?
 2 - What error should I look for when async write fails (if any) ? Or is
 there any other way to see that network throughput (for example) is too
 small for a given traffic.

 Hope this is clearer.

 C*heers,

 Alain

 2014-12-19 11:44 GMT+01:00 Jens Rantil jens.ran...@tink.se:

 Alain,

 AFAIK, the DC replication is not linearizable. That is, writes are are
 not replicated according to a binlog or similar like MySQL. They are
 replicated concurrently.

 To answer you questions:
 1 - Replication lag in Cassandra terms is probably “Hinted handoff”.
 You’d want to check the status of that.
 2 - `nodetool status` is your friend. It will tell you whether the
 cluster considers other nodes reachable or not. Run it on a node in the
 datacenter that you’d like to test connectivity from.

 Cheers,
 Jens

 ——— Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se
 Phone: +46 708 84 18 32 Web: www.tink.se Facebook Linkedin Twitter


 On Fri, Dec 19, 2014 at 11:16 AM, Alain RODRIGUEZ arodr...@gmail.com
 wrote:

 Hi guys,

 We expanded our cluster to a multiple DC configuration.

 Now I am wondering if there is any way to know:

 1 - The replication lag between these 2 DC (Opscenter, nodetool, other
 ?)
 2 - Make sure that sync is ok at any time

 I guess big companies running Cassandra are interested in these kind of
 info, so I think something exist but I am not aware of it.

 Any other important information or advice you can give me about best
 practices or tricks while running a multi DC (cross regions US - EU) is
 welcome of course !

 cheers,

 Alain





 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect

 [image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
 http://www.linkedin.com/pub/ryan-svihla/12/621/727/

 DataStax is the fastest, most scalable distributed database technology,
 delivering Apache Cassandra to the world’s most innovative enterprises.
 Datastax is built to be agile, always-on, and predictably scalable to any
 size. With more than 500 customers in 45 countries, DataStax is the
 database technology and transactional backbone of choice for the worlds
 most innovative companies such as Netflix, Adobe, Intuit, and eBay.




Re: simple data movement ?

2014-12-19 Thread Jonathan Haddad
It may be more valuable to set up your test cluster as the same version,
and make sure your tokens are the same.  then copy over your sstables.
 you'll have an exact replica of prod  you can test your upgrade process.

On Fri Dec 19 2014 at 11:04:58 AM Ryan Svihla rsvi...@datastax.com wrote:

 In theory, you could always do a data dump ..sstable to json and back for
 example, but you'd have to have your schema setup ,and I've not actually
 done this myself so YMMV.

 I've helped a bunch of folks with that upgrade path and while it's time
 consuming it does work.

 On Fri, Dec 19, 2014 at 8:49 AM, Langston, Jim jim.langs...@dynatrace.com
  wrote:

  Thanks, this looks uglier , I double checked my production cluster ( I
 have a staging and development cluster as well ) and
 production is on 1.2.8. A copy of the data resulted in a mssage :

  Exception encountered during startup: Incompatible SSTable found.
 Current version ka is unable to read file:
 /cassandra/apache-cassandra-2.1.2/bin/../data/data/system/schema_keyspaces/system-schema_keyspaces-ic-150.
 Please run upgradesstables.

  Is the move going to to be 1.2.8 -- 1.2.9 -- 2.0.x -- 2.1.2 ??

  Can I just dump the data and import it into 2.1.2 ??


  Jim

   From: Ryan Svihla rsvi...@datastax.com
 Reply-To: user@cassandra.apache.org
 Date: Thu, 18 Dec 2014 06:00:09 -0600
 To: user@cassandra.apache.org
 Subject: Re: simple data movement ?

  I'm not sure that'll work with that many version moves in the middle,
 upgrades are to my knowledge only tested between specific steps, namely
 from 1.2.9 to the latest 2.0.x


 http://www.datastax.com/documentation/upgrade/doc/upgrade/cassandra/upgradeC_c.html
  Specifically:

   Cassandra 2.0.x restrictions¶
 http://www.datastax.com/documentation/upgrade/doc/upgrade/cassandra/upgradeC_c.html?scroll=concept_ds_yqj_5xr_ck__section_ubt_nwr_54

 After downloading DataStax Community
 http://planetcassandra.org/cassandra/, upgrade to Cassandra directly
 from Cassandra 1.2.9 or later. Cassandra 2.0 is not network- or
 SSTable-compatible with versions older than 1.2.9. If your version of
 Cassandra is earlier than 1.2.9 and you want to perform a rolling restart
 http://www.datastax.com/documentation/cassandra/1.2/cassandra/glossary/gloss_rolling_restart.html,
 first upgrade the entire cluster to 1.2.9, and then to Cassandra 2.0.
  Cassandra 2.1.x restrictions¶
 http://www.datastax.com/documentation/upgrade/doc/upgrade/cassandra/upgradeC_c.html?scroll=concept_ds_yqj_5xr_ck__section_qzx_pwr_54

 Upgrade to Cassandra 2.1 from Cassandra 2.0.7 or later.

 Cassandra 2.1 is not compatible with Cassandra 1.x SSTables. First
 upgrade the nodes to Cassandra 2.0.7 or later, start the cluster, upgrade
 the SSTables, stop the cluster, and then upgrade to Cassandra 2.1.

 On Wed, Dec 17, 2014 at 10:55 PM, Ben Bromhead b...@instaclustr.com
 wrote:

 Just copy the data directory from each prod node to your test node (and
 relevant configuration files etc).

  If your IP addresses are different between test and prod, follow
 https://engineering.eventbrite.com/changing-the-ip-address-of-a-cassandra-node-with-auto_bootstrapfalse/


 On 18 December 2014 at 09:10, Langston, Jim jim.langs...@dynatrace.com
 wrote:

  Hi all,

  I have set up a test environment with C* 2.1.2, wanting to test our
 applications against it. I currently have C* 1.2.9 in production and
 want
 to use that data for testing. What would be a good approach for simply
 taking a copy of the production data and moving it into the test env and
 having the test env C* use that data ?

  The test env. is identical is size, with the difference being the
 versions
 of C*.

  Thanks,

  Jim
  The contents of this e-mail are intended for the named addressee only.
 It contains information that may be confidential. Unless you are the named
 addressee or an authorized designee, you may not copy or use it, or
 disclose it to anyone else. If you received it in error please notify us
 immediately and then destroy it



   --

 Ben Bromhead

 Instaclustr | www.instaclustr.com | @instaclustr
 http://twitter.com/instaclustr | +61 415 936 359



  --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect

 [image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
 http://www.linkedin.com/pub/ryan-svihla/12/621/727/

  DataStax is the fastest, most scalable distributed database technology,
 delivering Apache Cassandra to the world’s most innovative enterprises.
 Datastax is built to be agile, always-on, and predictably scalable to any
 size. With more than 500 customers in 45 countries, DataStax is the
 database technology and transactional backbone of choice for the worlds
 most innovative companies such as Netflix, Adobe, Intuit, and eBay.

 The contents of this e-mail are intended for the named addressee
 only. It contains information that may be confidential. Unless you are the
 named addressee or an authorized designee, you may not 

Node down during move

2014-12-19 Thread Jiri Horky
Hi list,

we added a new node to existing 8-nodes cluster with C* 1.2.9 without
vnodes and because we are almost totally out of space, we are shuffling
the token fone node after another (not in parallel). During one of this
move operations, the receiving node died and thus the streaming failed:

 WARN [Streaming to /X.Y.Z.18:2] 2014-12-19 19:25:56,227
StorageService.java (line 3703) Streaming to /X.Y.Z.18 failed
 INFO [RMI TCP Connection(12940)-X.Y.Z.17] 2014-12-19 19:25:56,233
ColumnFamilyStore.java (line 629) Enqueuing flush of
Memtable-local@433096244(70/70 serialized/live bytes, 2 ops)
 INFO [FlushWriter:3772] 2014-12-19 19:25:56,238 Memtable.java (line
461) Writing Memtable-local@433096244(70/70 serialized/live bytes, 2 ops)
ERROR [Streaming to /X.Y.Z.18:2] 2014-12-19 19:25:56,246
CassandraDaemon.java (line 192) Exception in thread Thread[Streaming to
/X.Y.Z.18:2,5,RMI Runtime]
java.lang.RuntimeException: java.io.IOException: Broken pipe
at com.google.common.base.Throwables.propagate(Throwables.java:160)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)

After restart of the receiving node, we tried to perform the move again,
but it failed with:

Exception in thread main java.io.IOException: target token
113427455640312821154458202477256070486 is already owned by another node.
at
org.apache.cassandra.service.StorageService.move(StorageService.java:2930)

So we tried to move it with a token just 1 higher, to trigger the
movement. This didn't move anything, but finished successfully:

 INFO [Thread-5520] 2014-12-19 20:00:24,689 StreamInSession.java (line
199) Finished streaming session 4974f3c0-87b1-11e4-bf1b-97d9ac6bd256
from /X.Y.Z.18

Now, it is quite improbable that the first streaming was done and it
died just after copying everything, as the ERROR was the last message
about streaming in the logs. Is there any way how to make sure the data
are really moved and thus running nodetool cleanup is safe?
   
Thank you.
Jiri Hoky


Re: Multi DC informations (sync)

2014-12-19 Thread Ryan Svihla
replies inline

On Fri, Dec 19, 2014 at 10:30 AM, Alain RODRIGUEZ arodr...@gmail.com
wrote:

 All that you said match the idea I had of how it works except this part:

 The request blocks however until all CL is satisfied -- Does this mean
 that the client will see an error if the local DC write the data correctly
 (i.e. CL reached) but the remote DC fails ? This is not the idea I had of
 something asynchronous...



Asynchronous is just all requests are sent out at once..the client response
is blocked till CL is satisfied or timeout occurs.

If CL is one for example..the first response back will be a success on
the client..regardless of what's happened in the background. If it's say
ALL..then yes it'd wait for all responses to come back.



 If it doesn't fail on client side (real asynchronous), is there a way to
 make sure remote DC has indeed received the information ? I mean if the
 throughput cross regions is to small, the write will fail and so will the
 HH, potentially. How to detect we are lacking of throughput cross DC for
 example ?

monitoring logging, etc, etc, etc

If an application needs EACH_QUORUM consistency across all data centers and
the performance penalty is worthwhile..then that's probably what you're
asking for. If LOCAL_QUORUM + regular repairs is fine..then do that..if CL
ONE is fine then do that.

You SHOULD BE monitoring dropped mutations and Hints via JMX or something
like Opscenter. Outages of substantial length should probably involve a
repair, if it's over your HH timeout, it DEFINITELY should involve a
repair. If you ever have a doubt it should involve repair.



 Repairs are indeed a good thing (we run them as a weekly routine, GC grace
 period 10 sec), but having inconsistency for a week without knowing it is
 quite an issue.


Then use a higher consistency level so that the client is not surprised,
and knows the state of things, and doesn't consider a write successful
until it's consistent across the data centers (i'd argue this is probably
not what you really want, but different applications have different needs).
If you need only local data center level awareness doing LOCAL_QUORUM reads
and writes will get you to where you want, but complete multidatacenter
nearly immediate consistency that you know about on the client is not free,
and it isn't with any system.




 Thanks for this detailed information Ryan, I hope I am clear enough while
 expressing my doubts.


I think it's a bit of a misunderstanding of the tools available. If you
have a need for full nearly immediate data center consistency, my
suggestion is a sizing (from a network pipe and application design SLA
perspective) for a higher CL on writes and potentially reads, the tools are
there.



 C*heers

 Alain

 2014-12-19 15:43 GMT+01:00 Ryan Svihla rsvi...@datastax.com:

 More accurately,the write path of Cassandra in a multi dc sense is kinda
 like the following

 1. write goes to a node which acts as coordinator
 2. writes go out to all replicas in that DC, and then one write per
 remote DC goes out to another node which takes responsibility for writing
 to all replicas in it's data center. The request blocks however until all
 CL is satisfied.
 3. if any of these writes fail by default a hinted handoff is generated..

 So as you can see..there is effectively not lag beyond either raw
 network latency+node speed and/or just failed writes and waiting on hint
 replay to occur. Likewise repairs can be used to make the data centers back
 in sync, and in the case of substantial outages you will need repairs to
 bring you back in sync, you're running repairs already right?

 Think of Cassandra as a global write, and not a message queue, and you've
 got the basic idea.


 On Fri, Dec 19, 2014 at 7:54 AM, Alain RODRIGUEZ arodr...@gmail.com
 wrote:

 Hi Jens, thanks for your insight.

 Replication lag in Cassandra terms is probably “Hinted handoff” -- Well
 I think hinted handoff are only used when a node is down, and are not even
 mandatory enabled. I guess that cross DC async replication is something
 else, taht has nothing to see with hinted handoff, am I wrong ?

 `nodetool status` is your friend. It will tell you whether the cluster
 considers other nodes reachable or not. Run it on a node in the datacenter
 that you’d like to test connectivity from. -- Connectivity ≠ write success

 Basically the two question can be changed this way:

 1 - How to monitor the async cross dc write latency ?
 2 - What error should I look for when async write fails (if any) ? Or is
 there any other way to see that network throughput (for example) is too
 small for a given traffic.

 Hope this is clearer.

 C*heers,

 Alain

 2014-12-19 11:44 GMT+01:00 Jens Rantil jens.ran...@tink.se:

 Alain,

 AFAIK, the DC replication is not linearizable. That is, writes are are
 not replicated according to a binlog or similar like MySQL. They are
 replicated concurrently.

 To answer you questions:
 1 - Replication lag in Cassandra terms is 

Re: Key Cache Questions

2014-12-19 Thread Ryan Svihla
if you have JNA installed it's stored off-heap in ram, without JNA it's
stored on heap in ram. The following should help explain in more depth

http://www.datastax.com/dev/blog/maximizing-cache-benefit-with-cassandra

On Fri, Dec 19, 2014 at 8:35 AM, Batranut Bogdan batra...@yahoo.com wrote:

 Hello all,
 I just read that the default size of the Key cache is 100 MB. Is it stored
 in memory or disk?



-- 

[image: datastax_logo.png] http://www.datastax.com/

Ryan Svihla

Solution Architect

[image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
http://www.linkedin.com/pub/ryan-svihla/12/621/727/

DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.


Re: Multi DC informations (sync)

2014-12-19 Thread Jonathan Haddad
Your gc grace should be longer than your repair schedule.  You're likely
 going to have deleted data resurface.

On Fri Dec 19 2014 at 8:31:13 AM Alain RODRIGUEZ arodr...@gmail.com wrote:

 All that you said match the idea I had of how it works except this part:

 The request blocks however until all CL is satisfied -- Does this mean
 that the client will see an error if the local DC write the data correctly
 (i.e. CL reached) but the remote DC fails ? This is not the idea I had of
 something asynchronous...

 If it doesn't fail on client side (real asynchronous), is there a way to
 make sure remote DC has indeed received the information ? I mean if the
 throughput cross regions is to small, the write will fail and so will the
 HH, potentially. How to detect we are lacking of throughput cross DC for
 example ?

 Repairs are indeed a good thing (we run them as a weekly routine, GC grace
 period 10 sec), but having inconsistency for a week without knowing it is
 quite an issue.

 Thanks for this detailed information Ryan, I hope I am clear enough while
 expressing my doubts.

 C*heers

 Alain

 2014-12-19 15:43 GMT+01:00 Ryan Svihla rsvi...@datastax.com:

 More accurately,the write path of Cassandra in a multi dc sense is kinda
 like the following

 1. write goes to a node which acts as coordinator
 2. writes go out to all replicas in that DC, and then one write per
 remote DC goes out to another node which takes responsibility for writing
 to all replicas in it's data center. The request blocks however until all
 CL is satisfied.
 3. if any of these writes fail by default a hinted handoff is generated..

 So as you can see..there is effectively not lag beyond either raw
 network latency+node speed and/or just failed writes and waiting on hint
 replay to occur. Likewise repairs can be used to make the data centers back
 in sync, and in the case of substantial outages you will need repairs to
 bring you back in sync, you're running repairs already right?

 Think of Cassandra as a global write, and not a message queue, and you've
 got the basic idea.


 On Fri, Dec 19, 2014 at 7:54 AM, Alain RODRIGUEZ arodr...@gmail.com
 wrote:

 Hi Jens, thanks for your insight.

 Replication lag in Cassandra terms is probably “Hinted handoff” -- Well
 I think hinted handoff are only used when a node is down, and are not even
 mandatory enabled. I guess that cross DC async replication is something
 else, taht has nothing to see with hinted handoff, am I wrong ?

 `nodetool status` is your friend. It will tell you whether the cluster
 considers other nodes reachable or not. Run it on a node in the datacenter
 that you’d like to test connectivity from. -- Connectivity ≠ write success

 Basically the two question can be changed this way:

 1 - How to monitor the async cross dc write latency ?
 2 - What error should I look for when async write fails (if any) ? Or is
 there any other way to see that network throughput (for example) is too
 small for a given traffic.

 Hope this is clearer.

 C*heers,

 Alain

 2014-12-19 11:44 GMT+01:00 Jens Rantil jens.ran...@tink.se:

 Alain,

 AFAIK, the DC replication is not linearizable. That is, writes are are
 not replicated according to a binlog or similar like MySQL. They are
 replicated concurrently.

 To answer you questions:
 1 - Replication lag in Cassandra terms is probably “Hinted handoff”.
 You’d want to check the status of that.
 2 - `nodetool status` is your friend. It will tell you whether the
 cluster considers other nodes reachable or not. Run it on a node in the
 datacenter that you’d like to test connectivity from.

 Cheers,
 Jens

 ——— Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se
 Phone: +46 708 84 18 32 Web: www.tink.se Facebook Linkedin Twitter


 On Fri, Dec 19, 2014 at 11:16 AM, Alain RODRIGUEZ arodr...@gmail.com
 wrote:

 Hi guys,

 We expanded our cluster to a multiple DC configuration.

 Now I am wondering if there is any way to know:

 1 - The replication lag between these 2 DC (Opscenter, nodetool, other
 ?)
 2 - Make sure that sync is ok at any time

 I guess big companies running Cassandra are interested in these kind
 of info, so I think something exist but I am not aware of it.

 Any other important information or advice you can give me about best
 practices or tricks while running a multi DC (cross regions US - EU) is
 welcome of course !

 cheers,

 Alain





 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect

 [image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
 http://www.linkedin.com/pub/ryan-svihla/12/621/727/

 DataStax is the fastest, most scalable distributed database technology,
 delivering Apache Cassandra to the world’s most innovative enterprises.
 Datastax is built to be agile, always-on, and predictably scalable to any
 size. With more than 500 customers in 45 countries, DataStax is the
 database technology and transactional backbone of choice for the worlds
 most 

Re: High Bloom Filter FP Ratio

2014-12-19 Thread Tyler Hobbs
I took a look at the code where the bloom filter true/false positive
counters are updated and notice that the true-positive count isn't being
updated on key cache hits:
https://issues.apache.org/jira/browse/CASSANDRA-8525.  That may explain
your ratios.

Can you try querying for a few non-existent partition keys in cqlsh with
tracing enabled (just run TRACING ON) and see if you really do get that
high of a false-positive ratio?

On Fri, Dec 19, 2014 at 9:59 AM, Mark Greene green...@gmail.com wrote:

 We're seeing similar behavior except our FP ratio is closer to 1.0 (100%).

 We're using Cassandra 2.1.2.


 Schema
 ---
 CREATE TABLE contacts.contact (
 id bigint,
 property_id int,
 created_at bigint,
 updated_at bigint,
 value blob,
 PRIMARY KEY (id, property_id)
 ) WITH CLUSTERING ORDER BY (property_id ASC)
 *AND bloom_filter_fp_chance = 0.001*
 AND caching = '{keys:ALL, rows_per_partition:NONE}'
 AND comment = ''
 AND compaction = {'min_threshold': '4', 'class':
 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy',
 'max_threshold': '32'}
 AND compression = {'sstable_compression':
 'org.apache.cassandra.io.compress.LZ4Compressor'}
 AND dclocal_read_repair_chance = 0.1
 AND default_time_to_live = 0
 AND gc_grace_seconds = 864000
 AND max_index_interval = 2048
 AND memtable_flush_period_in_ms = 0
 AND min_index_interval = 128
 AND read_repair_chance = 0.0
 AND speculative_retry = '99.0PERCENTILE';

 CF Stats Output:
 -
 Keyspace: contacts
 Read Count: 2458375
 Read Latency: 0.852844076675 ms.
 Write Count: 10357
 Write Latency: 0.1816912233272183 ms.
 Pending Flushes: 0
 Table: contact
 SSTable count: 61
 SSTables in each level: [1, 10, 50, 0, 0, 0, 0, 0, 0]
 Space used (live): 9047112471
 Space used (total): 9047112471
 Space used by snapshots (total): 0
 SSTable Compression Ratio: 0.34119240020241487
 Memtable cell count: 24570
 Memtable data size: 1299614
 Memtable switch count: 2
 Local read count: 2458290
 Local read latency: 0.853 ms
 Local write count: 10044
 Local write latency: 0.186 ms
 Pending flushes: 0
 Bloom filter false positives: 11096
 *Bloom filter false ratio: 0.99197*
 Bloom filter space used: 3923784
 Compacted partition minimum bytes: 373
 Compacted partition maximum bytes: 152321
 Compacted partition mean bytes: 9938
 Average live cells per slice (last five minutes): 37.57851240677983
 Maximum live cells per slice (last five minutes): 63.0
 Average tombstones per slice (last five minutes): 0.0
 Maximum tombstones per slice (last five minutes): 0.0

 --
 about.me http://about.me/markgreene

 On Wed, Dec 17, 2014 at 1:32 PM, Chris Hart ch...@remilon.com wrote:

 Hi,

 I have create the following table with bloom_filter_fp_chance=0.01:

 CREATE TABLE logged_event (
   time_key bigint,
   partition_key_randomizer int,
   resource_uuid timeuuid,
   event_json text,
   event_type text,
   field_error_list maptext, text,
   javascript_timestamp timestamp,
   javascript_uuid uuid,
   page_impression_guid uuid,
   page_request_guid uuid,
   server_received_timestamp timestamp,
   session_id bigint,
   PRIMARY KEY ((time_key, partition_key_randomizer), resource_uuid)
 ) WITH
   bloom_filter_fp_chance=0.01 AND
   caching='KEYS_ONLY' AND
   comment='' AND
   dclocal_read_repair_chance=0.00 AND
   gc_grace_seconds=864000 AND
   index_interval=128 AND
   read_repair_chance=0.00 AND
   replicate_on_write='true' AND
   populate_io_cache_on_flush='false' AND
   default_time_to_live=0 AND
   speculative_retry='99.0PERCENTILE' AND
   memtable_flush_period_in_ms=0 AND
   compaction={'class': 'SizeTieredCompactionStrategy'} AND
   compression={'sstable_compression': 'LZ4Compressor'};


 When I run cfstats, I see a much higher false positive ratio:

 Table: logged_event
 SSTable count: 15
 Space used (live), bytes: 104128214227
 Space used (total), bytes: 104129482871
 SSTable Compression Ratio: 0.3295840184239226
 Number of keys (estimate): 199293952
 Memtable cell count: 56364
 Memtable data size, bytes: 20903960
 Memtable switch count: 148
 Local read count: 1396402
 Local read latency: 0.362 ms
 Local write count: 2345306
 Local write latency: 0.062 ms
 Pending tasks: 0
 Bloom filter false positives: 147705
 Bloom filter false ratio: 0.49020
 Bloom filter space used, bytes: 

Re: High Bloom Filter FP Ratio

2014-12-19 Thread Chris Hart
Hi Tyler,

I tried what you said and false positives look much more reasonable there.  
Thanks for looking into this.

-Chris

- Original Message -
From: Tyler Hobbs ty...@datastax.com
To: user@cassandra.apache.org
Sent: Friday, December 19, 2014 1:25:29 PM
Subject: Re: High Bloom Filter FP Ratio

I took a look at the code where the bloom filter true/false positive
counters are updated and notice that the true-positive count isn't being
updated on key cache hits:
https://issues.apache.org/jira/browse/CASSANDRA-8525.  That may explain
your ratios.

Can you try querying for a few non-existent partition keys in cqlsh with
tracing enabled (just run TRACING ON) and see if you really do get that
high of a false-positive ratio?

On Fri, Dec 19, 2014 at 9:59 AM, Mark Greene green...@gmail.com wrote:

 We're seeing similar behavior except our FP ratio is closer to 1.0 (100%).

 We're using Cassandra 2.1.2.


 Schema
 ---
 CREATE TABLE contacts.contact (
 id bigint,
 property_id int,
 created_at bigint,
 updated_at bigint,
 value blob,
 PRIMARY KEY (id, property_id)
 ) WITH CLUSTERING ORDER BY (property_id ASC)
 *AND bloom_filter_fp_chance = 0.001*
 AND caching = '{keys:ALL, rows_per_partition:NONE}'
 AND comment = ''
 AND compaction = {'min_threshold': '4', 'class':
 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy',
 'max_threshold': '32'}
 AND compression = {'sstable_compression':
 'org.apache.cassandra.io.compress.LZ4Compressor'}
 AND dclocal_read_repair_chance = 0.1
 AND default_time_to_live = 0
 AND gc_grace_seconds = 864000
 AND max_index_interval = 2048
 AND memtable_flush_period_in_ms = 0
 AND min_index_interval = 128
 AND read_repair_chance = 0.0
 AND speculative_retry = '99.0PERCENTILE';

 CF Stats Output:
 -
 Keyspace: contacts
 Read Count: 2458375
 Read Latency: 0.852844076675 ms.
 Write Count: 10357
 Write Latency: 0.1816912233272183 ms.
 Pending Flushes: 0
 Table: contact
 SSTable count: 61
 SSTables in each level: [1, 10, 50, 0, 0, 0, 0, 0, 0]
 Space used (live): 9047112471
 Space used (total): 9047112471
 Space used by snapshots (total): 0
 SSTable Compression Ratio: 0.34119240020241487
 Memtable cell count: 24570
 Memtable data size: 1299614
 Memtable switch count: 2
 Local read count: 2458290
 Local read latency: 0.853 ms
 Local write count: 10044
 Local write latency: 0.186 ms
 Pending flushes: 0
 Bloom filter false positives: 11096
 *Bloom filter false ratio: 0.99197*
 Bloom filter space used: 3923784
 Compacted partition minimum bytes: 373
 Compacted partition maximum bytes: 152321
 Compacted partition mean bytes: 9938
 Average live cells per slice (last five minutes): 37.57851240677983
 Maximum live cells per slice (last five minutes): 63.0
 Average tombstones per slice (last five minutes): 0.0
 Maximum tombstones per slice (last five minutes): 0.0

 --
 about.me http://about.me/markgreene

 On Wed, Dec 17, 2014 at 1:32 PM, Chris Hart ch...@remilon.com wrote:

 Hi,

 I have create the following table with bloom_filter_fp_chance=0.01:

 CREATE TABLE logged_event (
   time_key bigint,
   partition_key_randomizer int,
   resource_uuid timeuuid,
   event_json text,
   event_type text,
   field_error_list maptext, text,
   javascript_timestamp timestamp,
   javascript_uuid uuid,
   page_impression_guid uuid,
   page_request_guid uuid,
   server_received_timestamp timestamp,
   session_id bigint,
   PRIMARY KEY ((time_key, partition_key_randomizer), resource_uuid)
 ) WITH
   bloom_filter_fp_chance=0.01 AND
   caching='KEYS_ONLY' AND
   comment='' AND
   dclocal_read_repair_chance=0.00 AND
   gc_grace_seconds=864000 AND
   index_interval=128 AND
   read_repair_chance=0.00 AND
   replicate_on_write='true' AND
   populate_io_cache_on_flush='false' AND
   default_time_to_live=0 AND
   speculative_retry='99.0PERCENTILE' AND
   memtable_flush_period_in_ms=0 AND
   compaction={'class': 'SizeTieredCompactionStrategy'} AND
   compression={'sstable_compression': 'LZ4Compressor'};


 When I run cfstats, I see a much higher false positive ratio:

 Table: logged_event
 SSTable count: 15
 Space used (live), bytes: 104128214227
 Space used (total), bytes: 104129482871
 SSTable Compression Ratio: 0.3295840184239226
 Number of keys (estimate): 199293952
 Memtable cell count: 56364
 Memtable data size, bytes: 20903960
 Memtable switch count: 148
 Local read count: 1396402
 Local 

Re: In place vnode conversion possible?

2014-12-19 Thread Robert Coli
On Fri, Dec 19, 2014 at 12:25 AM, Jonas Borgström jo...@borgstrom.se
wrote:

 Why would any streaming take place?

 Simply changing the tokens and restarting a node does not seem to
 trigger any streaming.


Oh, sorry for not reading the whole mail, I figured you were going to do
something less low level hacky. :)

That method seems like it would work. Basically in this case (RF=N) shotgun
range movement are safe, because nothing's actually moving.

=Rob


Re: Practical use of counters in the industry

2014-12-19 Thread Robert Coli
On Thu, Dec 18, 2014 at 7:19 PM, Rajath Subramanyam rajat...@gmail.com
wrote:

 Thanks Ken. Any other use cases where counters are used apart from
 Rainbird ?


Disqus use(d? s?) them behind an in-memory accumulator which batches and
periodically flushes. This is the best way to use old counters. New
counters should be usable in more cases without something in front of
them.

=Rob