[ 
https://issues.apache.org/jira/browse/CASSANDRA-8098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mck updated CASSANDRA-8098:
---------------------------
    Description: 
Today, using CqlInputFormat, it's only possible to 
 - enforce data-locality to one specific data-center, or
 - disable it by changing CL from LOCAL_ONE to ONE.

We need a way to enforce data-locality to specific *data-centers*, and would 
like to contribute a solution.

Suggested ideas
 - CqlInputFormat (gently) calls describeLocalRing against all the listed 
connection addresses and merge the results into one masterRangeNodes list, or 
 - changing the signature of describeLocalRing(..) to describeRings(String 
keyspace, String[] dc) and having the job specify which DCs it will be running 
within.


*Long description*
A lot has changed since CASSANDRA-2388 that has made life a lot easier with 
integrating c* and hadoop, for example: CqlInputFormat, CL.LOCAL_ONE, 
LimitedLocalNodeFirstLocalBalancingPolicy, vnodes, and describe_local_ring.

When using CqlInputFormat, if you don't want to be stuck within 
datacenter-locality you can for example change the consistency level from 
LOCAL_ONE to ONE. That's great, but describe_local_ring + CL.LOCAL_ONE in its 
current implementation isn't enough for us. We have multiple datacenters for 
offline, multiple for online, because we still want the availability advantages 
that come from aligning virtual datacenters to physical datacenters for the 
offline stuff too. That is using hadoop for aggregation purposes on top of c* 
doesn't always imply one can settle with an CP solution.

Some of our jobs have their own InputFormat implementation that uses 
describe_ring, LOCAL_ONE, and data with only replica in the offline 
datacenters. Works very well, except the last point kinda sucks because we have 
online clients that want to read this data and have to then do so through nodes 
in the offline datacenters. Underlying performance improvements: eg 
cross_node_timeout and speculative requests; have helped but there's still the 
need to separate online and offline. If we wanted to push replica out on to the 
online nodes, i think the best approach is for us is to have to filter out 
those splits/locations in getRangeMap(..)

Back to this issue we also have jobs using CqlInputFormat. Specifying multiple 
client input addresses doesn't help take advantage of the multiple offline 
datacenters because the Cassandra.Client only makes one call to 
describe_local_ring, and StorageService.describeLocalRing(..) only checks 
against its own address. It would work to have either a) CqlInputFormat call 
describeLocalRing against all the listed connection addresses and merge the 
results into one masterRangeNodes list, or b) something along the lines of 
changing the signature of describeLocalRing(..) to describeRings(String 
keyspace, String[] dc) and having the job specify which DCs it will be running 
within.

  was:
Today, using CqlInputFormat, it's only possible to 
 - enforce data-locality to one specific data-center, or
 - disable it by changing CL from LOCAL_ONE to ONE.

We need a way to enforce data-locality to specific *data-centers*, and would 
like to contribute a solution.

Suggested ideas
 - CqlInputFormat (gently) calls describeLocalRing against all the listed 
connection addresses and merge the results into one masterRangeNodes list, or 
 - changing the signature of describeLocalRing(..) to describeRings(String 
keyspace, String[] dc) and having the job specify which DCs it will be running 
within.


*Long description*
A lot has changed since CASSANDRA-2388 that has made life a lot easier with 
integrating c* and hadoop, for example: CqlInputFormat, CL.LOCAL_ONE, 
LimitedLocalNodeFirstLocalBalancingPolicy, vnodes, and describe_local_ring.

When using CqlInputFormat, if you don't want to be stuck within 
datacenter-locality you can for example change the consistency level from 
LOCAL_ONE to ONE. That's great, but describe_local_ring + CL.LOCAL_ONE in its 
current implementation isn't enough for us. We have multiple datacenters for 
offline, multiple for online, because we still want the availability advantages 
that come from aligning virtual datacenters to physical datacenters for the 
offline stuff too. 

Some of our jobs have their own InputFormat implementation that uses 
describe_ring, LOCAL_ONE, and data with only replica in the offline 
datacenters. Works very well, except the last point kinda sucks because we have 
online clients that want to read this data and have to then do so through nodes 
in the offline datacenters. Underlying performance improvements: eg 
cross_node_timeout and speculative requests; have helped but there's still the 
need to separate online and offline. If we wanted to push replica out on to the 
online nodes, i think the best approach is for us is to have to filter out 
those splits/locations in getRangeMap(..)

Back to this issue we also have jobs using CqlInputFormat. Specifying multiple 
client input addresses doesn't help take advantage of the multiple offline 
datacenters because the Cassandra.Client only makes one call to 
describe_local_ring, and StorageService.describeLocalRing(..) only checks 
against its own address. It would work to have either a) CqlInputFormat call 
describeLocalRing against all the listed connection addresses and merge the 
results into one masterRangeNodes list, or b) something along the lines of 
changing the signature of describeLocalRing(..) to describeRings(String 
keyspace, String[] dc) and having the job specify which DCs it will be running 
within.


> Allow CqlInputFormat to be restricted to more than one data-center
> ------------------------------------------------------------------
>
>                 Key: CASSANDRA-8098
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8098
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>            Reporter: mck
>            Assignee: mck
>
> Today, using CqlInputFormat, it's only possible to 
>  - enforce data-locality to one specific data-center, or
>  - disable it by changing CL from LOCAL_ONE to ONE.
> We need a way to enforce data-locality to specific *data-centers*, and would 
> like to contribute a solution.
> Suggested ideas
>  - CqlInputFormat (gently) calls describeLocalRing against all the listed 
> connection addresses and merge the results into one masterRangeNodes list, or 
>  - changing the signature of describeLocalRing(..) to describeRings(String 
> keyspace, String[] dc) and having the job specify which DCs it will be 
> running within.
> *Long description*
> A lot has changed since CASSANDRA-2388 that has made life a lot easier with 
> integrating c* and hadoop, for example: CqlInputFormat, CL.LOCAL_ONE, 
> LimitedLocalNodeFirstLocalBalancingPolicy, vnodes, and describe_local_ring.
> When using CqlInputFormat, if you don't want to be stuck within 
> datacenter-locality you can for example change the consistency level from 
> LOCAL_ONE to ONE. That's great, but describe_local_ring + CL.LOCAL_ONE in its 
> current implementation isn't enough for us. We have multiple datacenters for 
> offline, multiple for online, because we still want the availability 
> advantages that come from aligning virtual datacenters to physical 
> datacenters for the offline stuff too. That is using hadoop for aggregation 
> purposes on top of c* doesn't always imply one can settle with an CP solution.
> Some of our jobs have their own InputFormat implementation that uses 
> describe_ring, LOCAL_ONE, and data with only replica in the offline 
> datacenters. Works very well, except the last point kinda sucks because we 
> have online clients that want to read this data and have to then do so 
> through nodes in the offline datacenters. Underlying performance 
> improvements: eg cross_node_timeout and speculative requests; have helped but 
> there's still the need to separate online and offline. If we wanted to push 
> replica out on to the online nodes, i think the best approach is for us is to 
> have to filter out those splits/locations in getRangeMap(..)
> Back to this issue we also have jobs using CqlInputFormat. Specifying 
> multiple client input addresses doesn't help take advantage of the multiple 
> offline datacenters because the Cassandra.Client only makes one call to 
> describe_local_ring, and StorageService.describeLocalRing(..) only checks 
> against its own address. It would work to have either a) CqlInputFormat call 
> describeLocalRing against all the listed connection addresses and merge the 
> results into one masterRangeNodes list, or b) something along the lines of 
> changing the signature of describeLocalRing(..) to describeRings(String 
> keyspace, String[] dc) and having the job specify which DCs it will be 
> running within.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to