[GitHub] [beam] vmarquez commented on a change in pull request #10546: [BEAM-9008] Add CassandraIO readAll method

GitBox Tue, 02 Jun 2020 22:37:45 -0700


vmarquez commented on a change in pull request #10546:
URL: https://github.com/apache/beam/pull/10546#discussion_r434318532




##########
File path: 
sdks/java/io/cassandra/src/main/java/org/apache/beam/sdk/io/cassandra/CassandraIO.java
##########
@@ -326,7 +371,78 @@ private CassandraIO() {}
       checkArgument(entity() != null, "withEntity() is required");
       checkArgument(coder() != null, "withCoder() is required");
 
-      return input.apply(org.apache.beam.sdk.io.Read.from(new 
CassandraSource<>(this, null)));
+      ReadAll<T> readAll = CassandraIO.<T>readAll().withCoder(this.coder());
+
+      return input
+          .apply(Create.of(this))
+          .apply(ParDo.of(new SplitFn()))
+          .setCoder(SerializableCoder.of(new TypeDescriptor<Read<T>>() {}))
+          // .apply(Reshuffle.viaRandomKey())
+          .apply(readAll);
+    }
+
+    private class SplitFn extends DoFn<Read<T>, Read<T>> {
+
+      @ProcessElement
+      public void process(
+          @Element CassandraIO.Read<T> read, OutputReceiver<Read<T>> 
outputReceiver) {
+
+        try (Cluster cluster =
+            getCluster(
+                read.hosts(),
+                read.port(),
+                read.username(),
+                read.password(),
+                read.localDc(),
+                read.consistencyLevel())) {
+          if (isMurmur3Partitioner(cluster)) {
+            LOG.info("Murmur3Partitioner detected, splitting");
+
+            List<BigInteger> tokens =
+                cluster.getMetadata().getTokenRanges().stream()
+                    .map(tokenRange -> new 
BigInteger(tokenRange.getEnd().getValue().toString()))
+                    .collect(Collectors.toList());
+            Integer splitCount = cluster.getMetadata().getAllHosts().size();
+            if (read.minNumberOfSplits() != null && 
read.minNumberOfSplits().get() != null) {
+              splitCount = read.minNumberOfSplits().get();
+            }
+
+            SplitGenerator splitGenerator =
+                new SplitGenerator(cluster.getMetadata().getPartitioner());
+            splitGenerator
+                .generateSplits(splitCount, tokens)
+                .forEach(
+                    rr ->
+                        outputReceiver.output(
+                            CassandraIO.<T>read()

Review comment:
       I think we still need to create a new `Read<T>` because the 
`SplitGenerator` returns a `List<List<RingRange>>`, so each of the outer list 
will be a different Read, the inner List will be set to the RingRange. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] vmarquez commented on a change in pull request #10546: [BEAM-9008] Add CassandraIO readAll method

Reply via email to