Re: [PR] CASSANDRA-19563: Support bulk write via S3 [cassandra-analytics]
yifan-c merged PR #53: URL: https://github.com/apache/cassandra-analytics/pull/53 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
Re: [PR] CASSANDRA-19563: Support bulk write via S3 [cassandra-analytics]
yifan-c commented on code in PR #53: URL: https://github.com/apache/cassandra-analytics/pull/53#discussion_r1571206932 ## scripts/build-sidecar.sh: ## @@ -24,7 +24,7 @@ else SCRIPT_DIR=$( dirname -- "$( readlink -f -- "$0"; )"; ) SIDECAR_REPO="${SIDECAR_REPO:-https://github.com/apache/cassandra-sidecar.git}; SIDECAR_BRANCH="${SIDECAR_BRANCH:-trunk}" - SIDECAR_COMMIT="${SIDECAR_COMMIT:-20795db4d708b9287e0a2281695923bfb6fa9138}" + SIDECAR_COMMIT="${SIDECAR_COMMIT:-686d9e8699bd0a56857e24523d883120023b9841}" Review Comment: Updated to `fd6f7ac5f9f19dbbeeb9e7f80ca1fcbf60d5a4c6` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
Re: [PR] CASSANDRA-19563: Support bulk write via S3 [cassandra-analytics]
yifan-c commented on code in PR #53: URL: https://github.com/apache/cassandra-analytics/pull/53#discussion_r1571175753 ## cassandra-analytics-core/src/main/java/org/apache/cassandra/spark/bulkwriter/token/ConsistencyLevel.java: ## @@ -19,16 +19,42 @@ package org.apache.cassandra.spark.bulkwriter.token; +import java.util.Collection; +import java.util.Objects; import java.util.Set; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; +import com.google.common.base.Preconditions; + +import org.apache.cassandra.spark.common.model.CassandraInstance; +import org.apache.cassandra.spark.data.ReplicationFactor; public interface ConsistencyLevel { +/** + * Whether the consistency level only considers replicas in the local data center. + * + * @return true if only considering the local replicas; otherwise, return false + */ boolean isLocal(); -Logger LOGGER = LoggerFactory.getLogger(ConsistencyLevel.class); +/** + * Check consistency level with the collection of the succeeded instances + * + * @param succeededInstances the succeeded instances in the replica set + * @param replicationFactor replication factor to check with + * @param localDC the local data center name if required for the check + * @return true means the consistency level is _definitively_ satisfied. + * Meanwhile, returning false means no conclusion can be drawn + */ +boolean canBeSatisfied(Collection succeededInstances, Review Comment: Adding test in this patch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
Re: [PR] CASSANDRA-19563: Support bulk write via S3 [cassandra-analytics]
yifan-c commented on code in PR #53: URL: https://github.com/apache/cassandra-analytics/pull/53#discussion_r1569813654 ## scripts/build-sidecar.sh: ## @@ -24,7 +24,7 @@ else SCRIPT_DIR=$( dirname -- "$( readlink -f -- "$0"; )"; ) SIDECAR_REPO="${SIDECAR_REPO:-https://github.com/apache/cassandra-sidecar.git}; SIDECAR_BRANCH="${SIDECAR_BRANCH:-trunk}" - SIDECAR_COMMIT="${SIDECAR_COMMIT:-20795db4d708b9287e0a2281695923bfb6fa9138}" + SIDECAR_COMMIT="${SIDECAR_COMMIT:-686d9e8699bd0a56857e24523d883120023b9841}" Review Comment: will do -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
Re: [PR] CASSANDRA-19563: Support bulk write via S3 [cassandra-analytics]
frankgh commented on code in PR #53: URL: https://github.com/apache/cassandra-analytics/pull/53#discussion_r1569627320 ## cassandra-analytics-core/src/main/java/org/apache/cassandra/spark/bulkwriter/token/ConsistencyLevel.java: ## @@ -19,16 +19,42 @@ package org.apache.cassandra.spark.bulkwriter.token; +import java.util.Collection; +import java.util.Objects; import java.util.Set; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; +import com.google.common.base.Preconditions; + +import org.apache.cassandra.spark.common.model.CassandraInstance; +import org.apache.cassandra.spark.data.ReplicationFactor; public interface ConsistencyLevel { +/** + * Whether the consistency level only considers replicas in the local data center. + * + * @return true if only considering the local replicas; otherwise, return false + */ boolean isLocal(); -Logger LOGGER = LoggerFactory.getLogger(ConsistencyLevel.class); +/** + * Check consistency level with the collection of the succeeded instances + * + * @param succeededInstances the succeeded instances in the replica set + * @param replicationFactor replication factor to check with + * @param localDC the local data center name if required for the check + * @return true means the consistency level is _definitively_ satisfied. + * Meanwhile, returning false means no conclusion can be drawn + */ +boolean canBeSatisfied(Collection succeededInstances, Review Comment: it feels like this class should have unit tests. Maybe a follow up? ## cassandra-analytics-core/src/test/java/org/apache/cassandra/spark/utils/XXHash32DigestAlgorithmTest.java: ## @@ -49,7 +49,7 @@ class XXHash32DigestAlgorithmTest // $ xxh32sum file2.txt # -> ef976cbe // $ xxh32sum file3.txt # -> 08321e1e -@ParameterizedTest(name = "{index} fileName={0} expectedMd5={1}") +@ParameterizedTest(name = "{index} fileName={0} expectedDigest={1}") Review Comment: ## gradle.properties: ## @@ -16,7 +16,7 @@ # under the License. group=org.apache.cassandra.spark -version=1.0.0 +version=1.0.0.36 Review Comment: this shouldn't change ```suggestion version=1.0.0 ``` ## scripts/build-sidecar.sh: ## @@ -47,22 +47,27 @@ else cd "${SIDECAR_BUILD_DIR}" echo "branch ${SIDECAR_BRANCH} sha ${SIDECAR_COMMIT}" # check out the correct cassandra version: + # if the SIDECAR_BRANCH directory does not exist; we initialize from the scratch Review Comment: great comments! ## cassandra-analytics-core/src/main/java/org/apache/cassandra/spark/bulkwriter/CancelJobEvent.java: ## @@ -0,0 +1,38 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.cassandra.spark.bulkwriter; + +public class CancelJobEvent Review Comment: can we add some javadocs? ## gradle.properties: ## @@ -31,5 +31,6 @@ spark=3 # force version 4.5.1 of vertx to prevent issues initializing io.vertx.core.json.jackson.JacksonCodec, # which requires a newer version of jackson, which is not available in spark 2 vertxVersion=4.5.1 +aswSdkVersion=2.20.43 Review Comment: this is trailing very behind in the current dependency. Should we bump this to the latest or a more recent version? ## scripts/build-sidecar.sh: ## @@ -24,7 +24,7 @@ else SCRIPT_DIR=$( dirname -- "$( readlink -f -- "$0"; )"; ) SIDECAR_REPO="${SIDECAR_REPO:-https://github.com/apache/cassandra-sidecar.git}; SIDECAR_BRANCH="${SIDECAR_BRANCH:-trunk}" - SIDECAR_COMMIT="${SIDECAR_COMMIT:-20795db4d708b9287e0a2281695923bfb6fa9138}" + SIDECAR_COMMIT="${SIDECAR_COMMIT:-686d9e8699bd0a56857e24523d883120023b9841}" Review Comment: It looks like there's a new SHA, should we bump it ? ## cassandra-analytics-core/src/main/java/org/apache/cassandra/spark/bulkwriter/CassandraTopologyMonitor.java: ## @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding
[PR] CASSANDRA-19563: Support bulk write via S3 [cassandra-analytics]
yifan-c opened a new pull request, #53: URL: https://github.com/apache/cassandra-analytics/pull/53 This commit adds a configuration (writer) option to pick a transport other than the previously-implemented "direct upload to all sidecars" (now known as the "Direct" transport). The second transport, now being implemented, is the "S3_COMPAT" transport, which allows the job to upload the generated SSTables to an S3-compatible storage system, and then inform the Cassandra Sidecar that those files are available for download & commit. Additionally, a plug-in system was added to allow communications between custom transport hooks and the job, so the custom hook can provide updated credentials and out-of-band status updates on S3-related issues. Co-Authored-By: Yifan Cai Co-Authored-By: Doug Rohrer Co-Authored-By: Francisco Guerrero Co-Authored-By: Saranya Krishnakumar Patch by Yifan Cai, Doug Rohrer, Francisco Guerrero, Saranya Krishnakumar; Reviewed by TBD for CASSANDRA-19563 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org