Re: How to bulkload into a specific data center?
On Fri, Jan 9, 2015 at 3:55 PM, Robert Coli rc...@eventbrite.com wrote: On Fri, Jan 9, 2015 at 11:38 AM, Benyi Wang bewang.t...@gmail.com wrote: - Is it possible to modify SSTableLoader to allow it access one data center? Even if you only write to nodes in DC A, if you replicate that data to DC B, it will have to travel over the WAN anyway? What are you trying to avoid? I'm lucky that those are virtual data centers in LAN. I just don't want to have a load burst in the service virtual data center because it may downgrade the REST service. I'm trying to load data into the analytics virtual data center, then let cassandra slowly replicates data into the service virtual data center. It is ok for the REST service to read some old data during the time of replication. I'm wondering if I should just use Throttle speed in Mbits to solve my problem? Because I may load ~100 million, I think spark-cassandra-connector might be too slow. I'm wondering if the methods *Copy-the-sstables/”nodetool refresh” can be useful in h*ttp:// www.pythian.com/blog/bulk-loading-options-for-cassandra/ will be a good choice. I'm still a newbie to Cassandra. I could not understand what the author said in that page. The author of that post is as wise as he is modest... ;D One of my question is: * When I run a spark job in yarn mode, the sstables are created into YARN working directory. * Assume I have a way to copy the files into the Cassandra directory on the same node. * Because the data are distributed across all analytics data center's nodes, each one has only a part of sstables, node A has part A, node B has part B. If I run refresh on each node, eventually node A has part A,B, and node B will have part A,B too. Am I right? I'm not sure I fully understand your question, but... In order to run refresh without having to immediately run cleanup, you need to have SSTables which contain data only for ranges which the node you are loading them on. So for a RF=3, N=3 cluster without vnodes (simple case), data is naturally on every node. For RF=3, N=6 cluster A B C D E F, node C contains : - Third replica for A. - Second replica for B. - First replica for C. In order for you to generate the correct SSTable, you need to understand all 3 replicas that should be there. With vnodes and nodes joining and parting, this becomes more difficult. That's why people tend to use SSTableloader and the streaming interface : with SSTableloader, Cassandra takes input which might live on any replica and sends it to the appropriate nodes. =Rob http://twitter.com/rcolidba I'd better to stay at SSTableLoader. Thanks for your explanation.
Re: How to bulkload into a specific data center?
On Fri, Jan 9, 2015 at 11:38 AM, Benyi Wang bewang.t...@gmail.com wrote: - Is it possible to modify SSTableLoader to allow it access one data center? Even if you only write to nodes in DC A, if you replicate that data to DC B, it will have to travel over the WAN anyway? What are you trying to avoid? Because I may load ~100 million, I think spark-cassandra-connector might be too slow. I'm wondering if the methods *Copy-the-sstables/”nodetool refresh” can be useful in h*ttp:// www.pythian.com/blog/bulk-loading-options-for-cassandra/ will be a good choice. I'm still a newbie to Cassandra. I could not understand what the author said in that page. The author of that post is as wise as he is modest... ;D One of my question is: * When I run a spark job in yarn mode, the sstables are created into YARN working directory. * Assume I have a way to copy the files into the Cassandra directory on the same node. * Because the data are distributed across all analytics data center's nodes, each one has only a part of sstables, node A has part A, node B has part B. If I run refresh on each node, eventually node A has part A,B, and node B will have part A,B too. Am I right? I'm not sure I fully understand your question, but... In order to run refresh without having to immediately run cleanup, you need to have SSTables which contain data only for ranges which the node you are loading them on. So for a RF=3, N=3 cluster without vnodes (simple case), data is naturally on every node. For RF=3, N=6 cluster A B C D E F, node C contains : - Third replica for A. - Second replica for B. - First replica for C. In order for you to generate the correct SSTable, you need to understand all 3 replicas that should be there. With vnodes and nodes joining and parting, this becomes more difficult. That's why people tend to use SSTableloader and the streaming interface : with SSTableloader, Cassandra takes input which might live on any replica and sends it to the appropriate nodes. =Rob http://twitter.com/rcolidba
Re: How to bulkload into a specific data center?
Hi Ryan, Thanks for your reply. Now I understood how SSTableLoader works. - If I understand correctly, the current o.a.c.io.sstable.SSTableLoader doesn't use LOCAL_ONE or LOCAL_QUORUM. Is it right? - Is it possible to modify SSTableLoader to allow it access one data center? Because I may load ~100 million, I think spark-cassandra-connector might be too slow. I'm wondering if the methods *Copy-the-sstables/”nodetool refresh” can be useful in h*ttp:// www.pythian.com/blog/bulk-loading-options-for-cassandra/ will be a good choice. I'm still a newbie to Cassandra. I could not understand what the author said in that page. One of my question is: * When I run a spark job in yarn mode, the sstables are created into YARN working directory. * Assume I have a way to copy the files into the Cassandra directory on the same node. * Because the data are distributed across all analytics data center's nodes, each one has only a part of sstables, node A has part A, node B has part B. If I run refresh on each node, eventually node A has part A,B, and node B will have part A,B too. Am I right? Thanks. On Thu, Jan 8, 2015 at 6:34 AM, Ryan Svihla r...@foundev.pro wrote: Just noticed you'd sent this to the dev list, this is a question for only the user list, and please do not send questions of this type to the developer list. On Thu, Jan 8, 2015 at 8:33 AM, Ryan Svihla r...@foundev.pro wrote: The nature of replication factor is such that writes will go wherever there is replication. If you're wanting responses to be faster, and not involve the REST data center in the spark job for response I suggest using a cql driver and LOCAL_ONE or LOCAL_QUORUM consistency level (look at the spark cassandra connector here https://github.com/datastax/spark-cassandra-connector ) . While write traffic will still be replicated to the REST service data center, because you do want those results available, you will not be waiting on the remote data center to respond successful. Final point, bulk loading sends a copy per replica across the wire, so lets say you have RF3 in each data center that means bulk loading will send out 6 copies from that client at once, with normal mutations via thrift or cql writes between data centers go out as 1 copy, then that node will forward on to the other replicas. This means intra data center traffic in this case would be 3x more with the bulk loader than with using a traditional cql or thrift based client. On Wed, Jan 7, 2015 at 6:32 PM, Benyi Wang bewang.t...@gmail.com wrote: I set up two virtual data centers, one for analytics and one for REST service. The analytics data center sits top on Hadoop cluster. I want to bulk load my ETL results into the analytics data center so that the REST service won't have the heavy load. I'm using CQLTableInputFormat in my Spark Application, and I gave the nodes in analytics data center as Intialial address. However, I found my jobs were connecting to the REST service data center. How can I specify the data center? -- Thanks, Ryan Svihla -- Thanks, Ryan Svihla
Re: How to bulkload into a specific data center?
Just noticed you'd sent this to the dev list, this is a question for only the user list, and please do not send questions of this type to the developer list. On Thu, Jan 8, 2015 at 8:33 AM, Ryan Svihla r...@foundev.pro wrote: The nature of replication factor is such that writes will go wherever there is replication. If you're wanting responses to be faster, and not involve the REST data center in the spark job for response I suggest using a cql driver and LOCAL_ONE or LOCAL_QUORUM consistency level (look at the spark cassandra connector here https://github.com/datastax/spark-cassandra-connector ) . While write traffic will still be replicated to the REST service data center, because you do want those results available, you will not be waiting on the remote data center to respond successful. Final point, bulk loading sends a copy per replica across the wire, so lets say you have RF3 in each data center that means bulk loading will send out 6 copies from that client at once, with normal mutations via thrift or cql writes between data centers go out as 1 copy, then that node will forward on to the other replicas. This means intra data center traffic in this case would be 3x more with the bulk loader than with using a traditional cql or thrift based client. On Wed, Jan 7, 2015 at 6:32 PM, Benyi Wang bewang.t...@gmail.com wrote: I set up two virtual data centers, one for analytics and one for REST service. The analytics data center sits top on Hadoop cluster. I want to bulk load my ETL results into the analytics data center so that the REST service won't have the heavy load. I'm using CQLTableInputFormat in my Spark Application, and I gave the nodes in analytics data center as Intialial address. However, I found my jobs were connecting to the REST service data center. How can I specify the data center? -- Thanks, Ryan Svihla -- Thanks, Ryan Svihla
Re: How to bulkload into a specific data center?
The nature of replication factor is such that writes will go wherever there is replication. If you're wanting responses to be faster, and not involve the REST data center in the spark job for response I suggest using a cql driver and LOCAL_ONE or LOCAL_QUORUM consistency level (look at the spark cassandra connector here https://github.com/datastax/spark-cassandra-connector ) . While write traffic will still be replicated to the REST service data center, because you do want those results available, you will not be waiting on the remote data center to respond successful. Final point, bulk loading sends a copy per replica across the wire, so lets say you have RF3 in each data center that means bulk loading will send out 6 copies from that client at once, with normal mutations via thrift or cql writes between data centers go out as 1 copy, then that node will forward on to the other replicas. This means intra data center traffic in this case would be 3x more with the bulk loader than with using a traditional cql or thrift based client. On Wed, Jan 7, 2015 at 6:32 PM, Benyi Wang bewang.t...@gmail.com wrote: I set up two virtual data centers, one for analytics and one for REST service. The analytics data center sits top on Hadoop cluster. I want to bulk load my ETL results into the analytics data center so that the REST service won't have the heavy load. I'm using CQLTableInputFormat in my Spark Application, and I gave the nodes in analytics data center as Intialial address. However, I found my jobs were connecting to the REST service data center. How can I specify the data center? -- Thanks, Ryan Svihla