[GitHub] [iceberg] rdblue commented on a change in pull request #1663: Flink: write the CDC records into apache iceberg tables

GitBox Fri, 20 Nov 2020 15:51:53 -0800


rdblue commented on a change in pull request #1663:
URL: https://github.com/apache/iceberg/pull/1663#discussion_r528029018




##########
File path: core/src/main/java/org/apache/iceberg/io/BaseDeltaWriter.java
##########
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.io;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.function.Function;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.DeleteFile;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.StructLike;
+import org.apache.iceberg.deletes.PositionDelete;
+import org.apache.iceberg.relocated.com.google.common.base.MoreObjects;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.types.TypeUtil;
+import org.apache.iceberg.util.StructLikeMap;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class BaseDeltaWriter<T> implements DeltaWriter<T> {
+  private static final Logger LOG = 
LoggerFactory.getLogger(BaseDeltaWriter.class);
+
+  private final RollingContentFileWriter<DataFile, T> dataWriter;
+  private final RollingContentFileWriter<DeleteFile, T> equalityDeleteWriter;
+  private final RollingContentFileWriter<DeleteFile, PositionDelete<T>> 
posDeleteWriter;
+
+  private final PositionDelete<T> positionDelete = new PositionDelete<>();
+  private final StructLikeMap<List<FilePos>> insertedRowMap;
+
+  // Function to convert the generic data to a StructLike.
+  private final Function<T, StructLike> structLikeFun;
+
+  public BaseDeltaWriter(RollingContentFileWriter<DataFile, T> dataWriter) {
+    this(dataWriter, null);
+  }
+
+  public BaseDeltaWriter(RollingContentFileWriter<DataFile, T> dataWriter,
+                         RollingContentFileWriter<DeleteFile, 
PositionDelete<T>> posDeleteWriter) {
+    this(dataWriter, posDeleteWriter, null, null, null, null);
+  }
+
+  public BaseDeltaWriter(RollingContentFileWriter<DataFile, T> dataWriter,
+                         RollingContentFileWriter<DeleteFile, 
PositionDelete<T>> posDeleteWriter,
+                         RollingContentFileWriter<DeleteFile, T> 
equalityDeleteWriter,
+                         Schema tableSchema,
+                         List<Integer> equalityFieldIds,
+                         Function<T, StructLike> structLikeFun) {
+
+    Preconditions.checkNotNull(dataWriter, "Data writer should always not be 
null.");
+
+    if (posDeleteWriter == null) {
+      // Only accept INSERT records.
+      Preconditions.checkArgument(equalityDeleteWriter == null);
+    }
+
+    if (posDeleteWriter != null && equalityDeleteWriter == null) {
+      // Only accept INSERT records and position deletion.
+      Preconditions.checkArgument(tableSchema == null);
+      Preconditions.checkArgument(equalityFieldIds == null);
+    }
+
+    if (equalityDeleteWriter != null) {
+      // Accept insert records, position deletion, equality deletions.
+      Preconditions.checkNotNull(posDeleteWriter,
+          "Position delete writer shouldn't be null when writing equality 
deletions.");
+      Preconditions.checkNotNull(tableSchema, "Iceberg table schema shouldn't 
be null");
+      Preconditions.checkNotNull(equalityFieldIds, "Equality field ids 
shouldn't be null");
+      Preconditions.checkNotNull(structLikeFun, "StructLike function shouldn't 
be null");
+
+      Schema deleteSchema = TypeUtil.select(tableSchema, 
Sets.newHashSet(equalityFieldIds));
+      this.insertedRowMap = StructLikeMap.create(deleteSchema.asStruct());
+      this.structLikeFun = structLikeFun;
+    } else {
+      this.insertedRowMap = null;
+      this.structLikeFun = null;
+    }
+
+    this.dataWriter = dataWriter;
+    this.equalityDeleteWriter = equalityDeleteWriter;
+    this.posDeleteWriter = posDeleteWriter;
+  }
+
+  @Override
+  public void writeRow(T row) throws IOException {
+    if (enableEqualityDelete()) {
+      FilePos filePos = FilePos.create(dataWriter.currentPath(), 
dataWriter.currentPos());
+      insertedRowMap.compute(structLikeFun.apply(row), (k, v) -> {
+        if (v == null) {
+          return Lists.newArrayList(filePos);
+        } else {
+          v.add(filePos);
+          return v;

Review comment:
       Also, now that I'm thinking about it, we should _never_ throw an 
exception for this data because that would break processing. Instead, we should 
probably send the record to some callback (default no-op), log a warning, and 
either delete the previous copy or ignore the duplication.
   
   The next question is: if we do expect duplicate inserts, then what is the 
right behavior? Should we make the second insert replace the first? Or just 
ignore the duplication?
   
   I'm leaning toward adding a delete to replace the record, but that would 
only work if the two inserts were in the same checkpoint. If they arrive a few 
minutes apart, then the data would be duplicated in the table. But, since we 
consider the records identical by the insert key, I think it is correct to add 
a position delete for the first record.
   
   If we choose to ignore the duplication, we then need to keep track of both 
insert positions in case we get INSERT(1), INSERT(1), DELETE(1). The delete 
would need to drop both rows, not just the second. That's why I would say that 
a duplicate insert replaces the previously inserted row. That way we can keep 
track of just one FilePos per key and lower the complexity of tracking this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a change in pull request #1663: Flink: write the CDC records into apache iceberg tables

Reply via email to