Re: [PR] CASSANDRA-19563: Support bulk write via S3 [cassandra-analytics]

2024-04-22 Thread via GitHub


yifan-c merged PR #53:
URL: https://github.com/apache/cassandra-analytics/pull/53


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



Re: [PR] CASSANDRA-19563: Support bulk write via S3 [cassandra-analytics]

2024-04-18 Thread via GitHub


yifan-c commented on code in PR #53:
URL: 
https://github.com/apache/cassandra-analytics/pull/53#discussion_r1571206932


##
scripts/build-sidecar.sh:
##
@@ -24,7 +24,7 @@ else
   SCRIPT_DIR=$( dirname -- "$( readlink -f -- "$0"; )"; )
   
SIDECAR_REPO="${SIDECAR_REPO:-https://github.com/apache/cassandra-sidecar.git};
   SIDECAR_BRANCH="${SIDECAR_BRANCH:-trunk}"
-  SIDECAR_COMMIT="${SIDECAR_COMMIT:-20795db4d708b9287e0a2281695923bfb6fa9138}"
+  SIDECAR_COMMIT="${SIDECAR_COMMIT:-686d9e8699bd0a56857e24523d883120023b9841}"

Review Comment:
   Updated to `fd6f7ac5f9f19dbbeeb9e7f80ca1fcbf60d5a4c6`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



Re: [PR] CASSANDRA-19563: Support bulk write via S3 [cassandra-analytics]

2024-04-18 Thread via GitHub


yifan-c commented on code in PR #53:
URL: 
https://github.com/apache/cassandra-analytics/pull/53#discussion_r1571175753


##
cassandra-analytics-core/src/main/java/org/apache/cassandra/spark/bulkwriter/token/ConsistencyLevel.java:
##
@@ -19,16 +19,42 @@
 
 package org.apache.cassandra.spark.bulkwriter.token;
 
+import java.util.Collection;
+import java.util.Objects;
 import java.util.Set;
 
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
+import com.google.common.base.Preconditions;
+
+import org.apache.cassandra.spark.common.model.CassandraInstance;
+import org.apache.cassandra.spark.data.ReplicationFactor;
 
 public interface ConsistencyLevel
 {
+/**
+ * Whether the consistency level only considers replicas in the local data 
center.
+ *
+ * @return true if only considering the local replicas; otherwise, return 
false
+ */
 boolean isLocal();
 
-Logger LOGGER = LoggerFactory.getLogger(ConsistencyLevel.class);
+/**
+ * Check consistency level with the collection of the succeeded instances
+ *
+ * @param succeededInstances the succeeded instances in the replica set
+ * @param replicationFactor replication factor to check with
+ * @param localDC the local data center name if required for the check
+ * @return true means the consistency level is _definitively_ satisfied.
+ * Meanwhile, returning false means no conclusion can be drawn
+ */
+boolean canBeSatisfied(Collection 
succeededInstances,

Review Comment:
   Adding test in this patch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



Re: [PR] CASSANDRA-19563: Support bulk write via S3 [cassandra-analytics]

2024-04-17 Thread via GitHub


yifan-c commented on code in PR #53:
URL: 
https://github.com/apache/cassandra-analytics/pull/53#discussion_r1569813654


##
scripts/build-sidecar.sh:
##
@@ -24,7 +24,7 @@ else
   SCRIPT_DIR=$( dirname -- "$( readlink -f -- "$0"; )"; )
   
SIDECAR_REPO="${SIDECAR_REPO:-https://github.com/apache/cassandra-sidecar.git};
   SIDECAR_BRANCH="${SIDECAR_BRANCH:-trunk}"
-  SIDECAR_COMMIT="${SIDECAR_COMMIT:-20795db4d708b9287e0a2281695923bfb6fa9138}"
+  SIDECAR_COMMIT="${SIDECAR_COMMIT:-686d9e8699bd0a56857e24523d883120023b9841}"

Review Comment:
   will do



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



Re: [PR] CASSANDRA-19563: Support bulk write via S3 [cassandra-analytics]

2024-04-17 Thread via GitHub


frankgh commented on code in PR #53:
URL: 
https://github.com/apache/cassandra-analytics/pull/53#discussion_r1569627320


##
cassandra-analytics-core/src/main/java/org/apache/cassandra/spark/bulkwriter/token/ConsistencyLevel.java:
##
@@ -19,16 +19,42 @@
 
 package org.apache.cassandra.spark.bulkwriter.token;
 
+import java.util.Collection;
+import java.util.Objects;
 import java.util.Set;
 
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
+import com.google.common.base.Preconditions;
+
+import org.apache.cassandra.spark.common.model.CassandraInstance;
+import org.apache.cassandra.spark.data.ReplicationFactor;
 
 public interface ConsistencyLevel
 {
+/**
+ * Whether the consistency level only considers replicas in the local data 
center.
+ *
+ * @return true if only considering the local replicas; otherwise, return 
false
+ */
 boolean isLocal();
 
-Logger LOGGER = LoggerFactory.getLogger(ConsistencyLevel.class);
+/**
+ * Check consistency level with the collection of the succeeded instances
+ *
+ * @param succeededInstances the succeeded instances in the replica set
+ * @param replicationFactor replication factor to check with
+ * @param localDC the local data center name if required for the check
+ * @return true means the consistency level is _definitively_ satisfied.
+ * Meanwhile, returning false means no conclusion can be drawn
+ */
+boolean canBeSatisfied(Collection 
succeededInstances,

Review Comment:
   it feels like this class should have unit tests. Maybe a follow up?



##
cassandra-analytics-core/src/test/java/org/apache/cassandra/spark/utils/XXHash32DigestAlgorithmTest.java:
##
@@ -49,7 +49,7 @@ class XXHash32DigestAlgorithmTest
 // $ xxh32sum file2.txt # -> ef976cbe
 // $ xxh32sum file3.txt # -> 08321e1e
 
-@ParameterizedTest(name = "{index} fileName={0} expectedMd5={1}")
+@ParameterizedTest(name = "{index} fileName={0} expectedDigest={1}")

Review Comment:
    
   



##
gradle.properties:
##
@@ -16,7 +16,7 @@
 # under the License.
 
 group=org.apache.cassandra.spark
-version=1.0.0
+version=1.0.0.36

Review Comment:
   this shouldn't change
   ```suggestion
   version=1.0.0
   ```



##
scripts/build-sidecar.sh:
##
@@ -47,22 +47,27 @@ else
   cd "${SIDECAR_BUILD_DIR}"
   echo "branch ${SIDECAR_BRANCH} sha ${SIDECAR_COMMIT}"
   # check out the correct cassandra version:
+  # if the SIDECAR_BRANCH directory does not exist; we initialize from the 
scratch

Review Comment:
   great comments!



##
cassandra-analytics-core/src/main/java/org/apache/cassandra/spark/bulkwriter/CancelJobEvent.java:
##
@@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.cassandra.spark.bulkwriter;
+
+public class CancelJobEvent

Review Comment:
   can we add some javadocs?



##
gradle.properties:
##
@@ -31,5 +31,6 @@ spark=3
 # force version 4.5.1 of vertx to prevent issues initializing 
io.vertx.core.json.jackson.JacksonCodec,
 # which requires a newer version of jackson, which is not available in spark 2
 vertxVersion=4.5.1
+aswSdkVersion=2.20.43

Review Comment:
   this is trailing very behind in the current dependency. Should we bump this 
to the latest or a more recent version?



##
scripts/build-sidecar.sh:
##
@@ -24,7 +24,7 @@ else
   SCRIPT_DIR=$( dirname -- "$( readlink -f -- "$0"; )"; )
   
SIDECAR_REPO="${SIDECAR_REPO:-https://github.com/apache/cassandra-sidecar.git};
   SIDECAR_BRANCH="${SIDECAR_BRANCH:-trunk}"
-  SIDECAR_COMMIT="${SIDECAR_COMMIT:-20795db4d708b9287e0a2281695923bfb6fa9138}"
+  SIDECAR_COMMIT="${SIDECAR_COMMIT:-686d9e8699bd0a56857e24523d883120023b9841}"

Review Comment:
   It looks like there's a new SHA, should we bump it ?



##
cassandra-analytics-core/src/main/java/org/apache/cassandra/spark/bulkwriter/CassandraTopologyMonitor.java:
##
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding 

[PR] CASSANDRA-19563: Support bulk write via S3 [cassandra-analytics]

2024-04-16 Thread via GitHub


yifan-c opened a new pull request, #53:
URL: https://github.com/apache/cassandra-analytics/pull/53

   This commit adds a configuration (writer) option to pick a transport other 
than the previously-implemented "direct upload to all sidecars" (now known as 
the "Direct" transport).  The second transport, now being implemented, is the 
"S3_COMPAT" transport, which allows the job to upload the generated SSTables to 
an S3-compatible storage system, and then inform the Cassandra Sidecar that 
those files are available for download & commit.
   
   Additionally, a plug-in system was added to allow communications between 
custom transport hooks and the job, so the custom hook can provide updated 
credentials and out-of-band status updates on S3-related issues.
   
   Co-Authored-By: Yifan Cai 
   Co-Authored-By: Doug Rohrer 
   Co-Authored-By: Francisco Guerrero 
   Co-Authored-By: Saranya Krishnakumar 
   
   Patch by Yifan Cai, Doug Rohrer, Francisco Guerrero, Saranya Krishnakumar; 
Reviewed by TBD for CASSANDRA-19563


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org