[
https://issues.apache.org/jira/browse/CASSANDRA-15637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aleksey Yeschenko updated CASSANDRA-15637:
------------------------------------------
Reviewers: Aleksey Yeschenko, Brandon Williams (was: Brandon Williams)
> CqlInputFormat regression going from 2.1 to 3.x caused by semantic difference
> between thrift and the new system.size_estimates table when dealing with
> multiple dc deployments
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-15637
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15637
> Project: Cassandra
> Issue Type: Bug
> Components: Legacy/Tools
> Reporter: David Capwell
> Assignee: David Capwell
> Priority: Normal
> Labels: pull-request-available
> Fix For: 4.0-alpha
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> In 3.0 CqlInputFormat switched away from thrift in favor of a new
> system.size_estimates table, but the semantics changed when dealing with
> multiple DCs or when Cassandra is not collocated with Hadoop.
> The core issues are:
> * system.size_estimates uses the primary range, in a multi-dc setup this
> could lead to uneven ranges
> example:
> {code}
> DC1: [0, 10, 20, 30]
> DC2: [1, 11, 21, 31]
> DC3: [2, 12, 22, 32]
> {code}
> Using NetworkTopologyStrategy the primary ranges are: [0, 1), [1, 2), [2,
> 10), [10, 11), [11, 12), [12, 20), [20, 21), [21, 22), [22, 30), [30, 31),
> [31, 32), [32, 0).
> Given this the only ranges that are more than one token are: [2, 10), [12,
> 20), [22, 30).
> * system.size_estimates is not replicated so need to hit every node in the
> cluster to get estimates, if nodes are down in the DC with non-size-1 ranges
> there is no way to get a estimate.
> * CqlInputFormat used to call describe_local_ring so all interactions were
> with a single DC, the java driver doesn't filter the DC so looks to allow
> cross DC traffic and includes nodes from other DCs in the replica set; in the
> example above, the amount of splits went from 4 to 12.
> * CqlInputFormat used to call describe_splits_ex to dynamically calculate the
> estimates, this was on the "local primary range" and was able to hit replicas
> to create estimates if the primary was down. With system.size_estimates we no
> longer have backup and no longer expose the "local primary range" in multi-dc.
> * CqlInputFormat had a config cassandra.input.keyRange which let you define
> your own range. If the range doesn't perfectly match the local range then
> the intersectWith calls will produce ranges with no estimates. Example: [0,
> 10, 20], cassandra.input.keyRange=5,15. This won't find any estimates so
> will produce 2 splits with 128 estimate (default when not found).
> * CqlInputFormat special cases Cassandra being collocated with Hadoop and
> assumes this when querying system.size_estimates as it doesn't filter to the
> specific host, this means that non-collocated deployments randomly select the
> nodes and create splits with ranges the hosts do not have locally.
> The problems are deterministic to replicate, the following test will show it
> 1) deploy a 3 DC cluster with 3 nodes each
> 2) create DC2 tokens are +1 of DC1 and DC3 are +1 of DC2
> 3) CREATE KEYSPACE simpleuniform0 WITH replication = {‘class’:
> ‘NetworkTopologyStrategy’, ‘DC1’: 3, ‘DC2’: 3, ‘DC3’: 3};
> 4) CREATE TABLE simpletable0 (pk bigint, ck bigint, value blob, PRIMARY KEY
> (pk, ck))
> 5) insert 500k partitions uniformly: [0, 500,000)
> 6) wait until estimates catch up to writes
> 7) for all nodes, SELECT * FROM system.size_estimates
> You will get the following
> {code}
> keyspace_name | table_name | range_start | range_end
> | mean_partition_size | partitions_count
> ----------------+--------------+----------------------+----------------------+---------------------+------------------
> simpleuniform0 | simpletable0 | -9223372036854775808 | -6148914691236517206
> | 87 | 122240
> simpleuniform0 | simpletable0 | 6148914691236517207 | -9223372036854775808
> | 87 | 121472
> (2 rows)
> keyspace_name | table_name | range_start | range_end |
> mean_partition_size | partitions_count
> ----------------+--------------+-------------+---------------------+---------------------+------------------
> simpleuniform0 | simpletable0 | 2 | 6148914691236517205 |
> 87 | 243072
> (1 rows)
> keyspace_name | table_name | range_start | range_end
> | mean_partition_size | partitions_count
> ----------------+--------------+----------------------+----------------------+---------------------+------------------
> simpleuniform0 | simpletable0 | -6148914691236517206 | -6148914691236517205
> | 87 | 1
> (1 rows)
> keyspace_name | table_name | range_start | range_end |
> mean_partition_size | partitions_count
> ----------------+--------------+-------------+-----------+---------------------+------------------
> simpleuniform0 | simpletable0 | 0 | 1 |
> 87 | 1
> (1 rows)
> keyspace_name | table_name | range_start | range_end |
> mean_partition_size | partitions_count
> ----------------+--------------+---------------------+---------------------+---------------------+------------------
> simpleuniform0 | simpletable0 | 6148914691236517205 | 6148914691236517206 |
> 87 | 1
> (1 rows)
> keyspace_name | table_name | range_start | range_end
> | mean_partition_size | partitions_count
> ----------------+--------------+----------------------+----------------------+---------------------+------------------
> simpleuniform0 | simpletable0 | -6148914691236517205 | -6148914691236517204
> | 87 | 1
> (1 rows)
> keyspace_name | table_name | range_start | range_end |
> mean_partition_size | partitions_count
> ----------------+--------------+-------------+-----------+---------------------+------------------
> simpleuniform0 | simpletable0 | 1 | 2 |
> 87 | 1
> (1 rows)
> keyspace_name | table_name | range_start | range_end |
> mean_partition_size | partitions_count
> ----------------+--------------+---------------------+---------------------+---------------------+------------------
> simpleuniform0 | simpletable0 | 6148914691236517206 | 6148914691236517207 |
> 87 | 1
> (1 rows)
> {code}
> 8) create a MR job against simpleuniform0. simpletable0, you will get 10
> splits where as 2.1 was 4
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]