Re: How to bulkload into a specific data center?

2015-01-10 Thread Benyi Wang
On Fri, Jan 9, 2015 at 3:55 PM, Robert Coli rc...@eventbrite.com wrote:

 On Fri, Jan 9, 2015 at 11:38 AM, Benyi Wang bewang.t...@gmail.com wrote:


- Is it possible to modify SSTableLoader to allow it access one data
center?

 Even if you only write to nodes in DC A, if you replicate that data to DC
 B, it will have to travel over the WAN anyway? What are you trying to avoid?



I'm lucky that those are virtual data centers in LAN.

I just don't want to have a load burst in the service virtual data center
because it may downgrade the REST service. I'm trying to load data into the
analytics virtual data center, then let cassandra slowly replicates
data into the service virtual data center. It is ok for the REST service
to read some old data during the time of replication.

I'm wondering if I should just use Throttle speed in Mbits to solve my
problem?

Because I may load ~100 million, I think spark-cassandra-connector might be
 too slow. I'm wondering if the methods *Copy-the-sstables/”nodetool
 refresh” can be useful in h*ttp://
 www.pythian.com/blog/bulk-loading-options-for-cassandra/ will be a good
 choice. I'm still a newbie to Cassandra. I could not understand what the
 author said in that page.


 The author of that post is as wise as he is modest... ;D


 One of my question is:

 * When I run a spark job in yarn mode, the sstables are created into YARN
 working directory.
 * Assume I have a way to copy the files into the Cassandra directory on
 the same node.
 * Because the data are distributed across all analytics data center's
 nodes, each one has only a part of sstables, node A has part A, node B has
 part B. If I run refresh on each node, eventually node A has part A,B, and
 node B will have part A,B too. Am I right?


 I'm not sure I fully understand your question, but...

 In order to run refresh without having to immediately run cleanup, you
 need to have SSTables which contain data only for ranges which the node you
 are loading them on.

 So for a RF=3, N=3 cluster without vnodes (simple case), data is naturally
 on every node.

 For RF=3, N=6 cluster A B C D E F, node C contains :

 - Third replica for A.
 - Second replica for B.
 - First replica for C.

 In order for you to generate the correct SSTable, you need to understand
 all 3 replicas that should be there. With vnodes and nodes joining and
 parting, this becomes more difficult.

 That's why people tend to use SSTableloader and the streaming interface :
 with SSTableloader, Cassandra takes input which might live on any replica
 and sends it to the appropriate nodes.

 =Rob
 http://twitter.com/rcolidba


I'd better to stay at SSTableLoader. Thanks for your explanation.


Re: How to bulkload into a specific data center?

2015-01-09 Thread Robert Coli
On Fri, Jan 9, 2015 at 11:38 AM, Benyi Wang bewang.t...@gmail.com wrote:


- Is it possible to modify SSTableLoader to allow it access one data
center?

 Even if you only write to nodes in DC A, if you replicate that data to DC
B, it will have to travel over the WAN anyway? What are you trying to avoid?


 Because I may load ~100 million, I think spark-cassandra-connector might
 be too slow. I'm wondering if the methods *Copy-the-sstables/”nodetool
 refresh” can be useful in h*ttp://
 www.pythian.com/blog/bulk-loading-options-for-cassandra/ will be a good
 choice. I'm still a newbie to Cassandra. I could not understand what the
 author said in that page.


The author of that post is as wise as he is modest... ;D


 One of my question is:

 * When I run a spark job in yarn mode, the sstables are created into YARN
 working directory.
 * Assume I have a way to copy the files into the Cassandra directory on
 the same node.
 * Because the data are distributed across all analytics data center's
 nodes, each one has only a part of sstables, node A has part A, node B has
 part B. If I run refresh on each node, eventually node A has part A,B, and
 node B will have part A,B too. Am I right?


I'm not sure I fully understand your question, but...

In order to run refresh without having to immediately run cleanup, you need
to have SSTables which contain data only for ranges which the node you are
loading them on.

So for a RF=3, N=3 cluster without vnodes (simple case), data is naturally
on every node.

For RF=3, N=6 cluster A B C D E F, node C contains :

- Third replica for A.
- Second replica for B.
- First replica for C.

In order for you to generate the correct SSTable, you need to understand
all 3 replicas that should be there. With vnodes and nodes joining and
parting, this becomes more difficult.

That's why people tend to use SSTableloader and the streaming interface :
with SSTableloader, Cassandra takes input which might live on any replica
and sends it to the appropriate nodes.

=Rob
http://twitter.com/rcolidba


Re: How to bulkload into a specific data center?

2015-01-09 Thread Benyi Wang
Hi Ryan,

Thanks for your reply. Now I understood how SSTableLoader works.

   - If I understand correctly, the current o.a.c.io.sstable.SSTableLoader
   doesn't use LOCAL_ONE or LOCAL_QUORUM. Is it right?
   - Is it possible to modify SSTableLoader to allow it access one data
   center?

Because I may load ~100 million, I think spark-cassandra-connector might be
too slow. I'm wondering if the methods *Copy-the-sstables/”nodetool
refresh” can be useful in h*ttp://
www.pythian.com/blog/bulk-loading-options-for-cassandra/ will be a good
choice. I'm still a newbie to Cassandra. I could not understand what the
author said in that page. One of my question is:

* When I run a spark job in yarn mode, the sstables are created into YARN
working directory.
* Assume I have a way to copy the files into the Cassandra directory on the
same node.
* Because the data are distributed across all analytics data center's
nodes, each one has only a part of sstables, node A has part A, node B has
part B. If I run refresh on each node, eventually node A has part A,B, and
node B will have part A,B too. Am I right?

Thanks.

On Thu, Jan 8, 2015 at 6:34 AM, Ryan Svihla r...@foundev.pro wrote:

 Just noticed you'd sent this to the dev list, this is a question for only
 the user list, and please do not send questions of this type to the
 developer list.

 On Thu, Jan 8, 2015 at 8:33 AM, Ryan Svihla r...@foundev.pro wrote:

  The nature of replication factor is such that writes will go wherever
  there is replication. If you're wanting responses to be faster, and not
  involve the REST data center in the spark job for response I suggest
 using
  a cql driver and LOCAL_ONE or LOCAL_QUORUM consistency level (look at the
  spark cassandra connector here
  https://github.com/datastax/spark-cassandra-connector ) . While write
  traffic will still be replicated to the REST service data center, because
  you do want those results available, you will not be waiting on the
 remote
  data center to respond successful.
 
  Final point, bulk loading sends a copy per replica across the wire, so
  lets say you have RF3 in each data center that means bulk loading will
 send
  out 6 copies from that client at once, with normal mutations via thrift
 or
  cql writes between data centers go out as 1 copy, then that node will
  forward on to the other replicas. This means intra data center traffic in
  this case would be 3x more with the bulk loader than with using a
  traditional cql or thrift based client.
 
 
 
  On Wed, Jan 7, 2015 at 6:32 PM, Benyi Wang bewang.t...@gmail.com
 wrote:
 
  I set up two virtual data centers, one for analytics and one for REST
  service. The analytics data center sits top on Hadoop cluster. I want to
  bulk load my ETL results into the analytics data center so that the REST
  service won't have the heavy load. I'm using CQLTableInputFormat in my
  Spark Application, and I gave the nodes in analytics data center as
  Intialial address.
 
  However, I found my jobs were connecting to the REST service data
 center.
 
  How can I specify the data center?
 
 
 
 
  --
 
  Thanks,
  Ryan Svihla
 
 


 --

 Thanks,
 Ryan Svihla



Re: How to bulkload into a specific data center?

2015-01-08 Thread Ryan Svihla
Just noticed you'd sent this to the dev list, this is a question for only
the user list, and please do not send questions of this type to the
developer list.

On Thu, Jan 8, 2015 at 8:33 AM, Ryan Svihla r...@foundev.pro wrote:

 The nature of replication factor is such that writes will go wherever
 there is replication. If you're wanting responses to be faster, and not
 involve the REST data center in the spark job for response I suggest using
 a cql driver and LOCAL_ONE or LOCAL_QUORUM consistency level (look at the
 spark cassandra connector here
 https://github.com/datastax/spark-cassandra-connector ) . While write
 traffic will still be replicated to the REST service data center, because
 you do want those results available, you will not be waiting on the remote
 data center to respond successful.

 Final point, bulk loading sends a copy per replica across the wire, so
 lets say you have RF3 in each data center that means bulk loading will send
 out 6 copies from that client at once, with normal mutations via thrift or
 cql writes between data centers go out as 1 copy, then that node will
 forward on to the other replicas. This means intra data center traffic in
 this case would be 3x more with the bulk loader than with using a
 traditional cql or thrift based client.



 On Wed, Jan 7, 2015 at 6:32 PM, Benyi Wang bewang.t...@gmail.com wrote:

 I set up two virtual data centers, one for analytics and one for REST
 service. The analytics data center sits top on Hadoop cluster. I want to
 bulk load my ETL results into the analytics data center so that the REST
 service won't have the heavy load. I'm using CQLTableInputFormat in my
 Spark Application, and I gave the nodes in analytics data center as
 Intialial address.

 However, I found my jobs were connecting to the REST service data center.

 How can I specify the data center?




 --

 Thanks,
 Ryan Svihla




-- 

Thanks,
Ryan Svihla


Re: How to bulkload into a specific data center?

2015-01-08 Thread Ryan Svihla
The nature of replication factor is such that writes will go wherever there
is replication. If you're wanting responses to be faster, and not involve
the REST data center in the spark job for response I suggest using a cql
driver and LOCAL_ONE or LOCAL_QUORUM consistency level (look at the spark
cassandra connector here
https://github.com/datastax/spark-cassandra-connector ) . While write
traffic will still be replicated to the REST service data center, because
you do want those results available, you will not be waiting on the remote
data center to respond successful.

Final point, bulk loading sends a copy per replica across the wire, so lets
say you have RF3 in each data center that means bulk loading will send out
6 copies from that client at once, with normal mutations via thrift or cql
writes between data centers go out as 1 copy, then that node will forward
on to the other replicas. This means intra data center traffic in this case
would be 3x more with the bulk loader than with using a traditional cql or
thrift based client.



On Wed, Jan 7, 2015 at 6:32 PM, Benyi Wang bewang.t...@gmail.com wrote:

 I set up two virtual data centers, one for analytics and one for REST
 service. The analytics data center sits top on Hadoop cluster. I want to
 bulk load my ETL results into the analytics data center so that the REST
 service won't have the heavy load. I'm using CQLTableInputFormat in my
 Spark Application, and I gave the nodes in analytics data center as
 Intialial address.

 However, I found my jobs were connecting to the REST service data center.

 How can I specify the data center?




-- 

Thanks,
Ryan Svihla