Re: Decommissioned nodes and FailureDetector

2018-01-26 Thread Oleksandr Shulgin
On Fri, Jan 19, 2018 at 6:53 PM, Tom van der Woerdt <
tom.vanderwoe...@booking.com> wrote:

>
> Here's the code I use, hope it helps:
> ...
>

Thanks Tom, that really does the trick!

--
Alex


Re: Decommissioned nodes and FailureDetector

2018-01-19 Thread Tom van der Woerdt
Hi Oleksandr,

Here's the code I use, hope it helps:

ownership = jolokia_read("org.apache.cassandra.db:type=StorageService",
"Ownership")
unreachable =
jolokia_read("org.apache.cassandra.db:type=StorageService",
"UnreachableNodes")
ownership_by_ip = {}
for nodeinfo, ownership_ratio in ownership.items():
ownership_by_ip[nodeinfo.split('/')[1]] = ownership_ratio

unreachable_and_has_data = []
for node in set(unreachable):
if node not in ownership_by_ip or ownership_by_ip[node] == 0:
continue
unreachable_and_has_data.append(node)

unreachable_racks = {}
for node in unreachable_and_has_data:
its_rack =
jolokia_exec("org.apache.cassandra.db:type=EndpointSnitchInfo",
"getRack/%s" % node)
its_dc =
jolokia_exec("org.apache.cassandra.db:type=EndpointSnitchInfo",
"getDatacenter/%s" % node)
rack_name = "%s %s" % (its_rack, its_dc)
unreachable_racks[rack_name] = 1

racks_unreachable = len(unreachable_racks.keys())
nodes_unreachable = len(unreachable_and_has_data)

This also looks at the number of unreachable racks, so if you only care
about nodes you should be able to get rid of most code here.

Tom van der Woerdt
Site Reliability Engineer

Booking.com B.V.
Vijzelstraat 66-80 Amsterdam 1017HL Netherlands
[image: Booking.com] 
The world's #1 accommodation site
43 languages, 198+ offices worldwide, 120,000+ global destinations,
1,550,000+ room nights booked every day
No booking fees, best price always guaranteed
Subsidiary of the Priceline Group (NASDAQ: PCLN)

On Fri, Jan 19, 2018 at 12:28 PM, Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Fri, Jan 19, 2018 at 11:17 AM, Nicolas Guyomar <
> nicolas.guyo...@gmail.com> wrote:
>
>> Hi,
>>
>> Not sure if StorageService should be accessed, but you can check node
>> movement here :
>> 'org.apache.cassandra.db:type=StorageService/LeavingNodes',
>> 'org.apache.cassandra.db:type=StorageService/LiveNodes',
>> 'org.apache.cassandra.db:type=StorageService/UnreachableNodes',
>>
>
> Checking the list of  Unreachable Nodes doesn't help unfortunately, since
> it contains a mix of decommissioned and just DOWN nodes.  So the total
> number of addresses in this list is equal to the DownEndpointCount, from
> the perspective of a node where you query it.
>
> --
> Alex
>
>


Re: Decommissioned nodes and FailureDetector

2018-01-19 Thread Oleksandr Shulgin
On Fri, Jan 19, 2018 at 11:17 AM, Nicolas Guyomar  wrote:

> Hi,
>
> Not sure if StorageService should be accessed, but you can check node
> movement here :
> 'org.apache.cassandra.db:type=StorageService/LeavingNodes',
> 'org.apache.cassandra.db:type=StorageService/LiveNodes',
> 'org.apache.cassandra.db:type=StorageService/UnreachableNodes',
>

Checking the list of  Unreachable Nodes doesn't help unfortunately, since
it contains a mix of decommissioned and just DOWN nodes.  So the total
number of addresses in this list is equal to the DownEndpointCount, from
the perspective of a node where you query it.

--
Alex


Re: Decommissioned nodes and FailureDetector

2018-01-19 Thread Nicolas Guyomar
Hi,

Not sure if StorageService should be accessed, but you can check node
movement here :
'org.apache.cassandra.db:type=StorageService/LeavingNodes',
'org.apache.cassandra.db:type=StorageService/LiveNodes',
'org.apache.cassandra.db:type=StorageService/UnreachableNodes',
'org.apache.cassandra.db:type=StorageService/MovingNodes'

On 19 January 2018 at 09:42, Oleksandr Shulgin  wrote:

> Hello,
>
> Is there a better way to monitor for Cassandra nodes going Down than
> querying via JMX for a condition like FailureDetector.DownEndpointCount >
> 0?
>
> The problem for us is when any node is decommissioned, it affects the
> DownEndpointCount for another ~3 days (the famous 72 hours of gossip).
>
> Is there a similar metric to be observed which doesn't include nodes which
> are expected to be down?
>
> Regards,
> --
> Oleksandr "Alex" Shulgin | Database Engineer | Zalando SE | Tel: +49 176
> 127-59-707 <+49%20176%2012759707>
>
>


Decommissioned nodes and FailureDetector

2018-01-19 Thread Oleksandr Shulgin
Hello,

Is there a better way to monitor for Cassandra nodes going Down than
querying via JMX for a condition like FailureDetector.DownEndpointCount > 0?

The problem for us is when any node is decommissioned, it affects the
DownEndpointCount for another ~3 days (the famous 72 hours of gossip).

Is there a similar metric to be observed which doesn't include nodes which
are expected to be down?

Regards,
-- 
Oleksandr "Alex" Shulgin | Database Engineer | Zalando SE | Tel: +49 176
127-59-707