Re: [PR] Handle map and array types binary record data [flink-cdc]

via GitHub Fri, 05 Jul 2024 11:48:03 -0700


umeshdangat commented on code in PR #3434:
URL: https://github.com/apache/flink-cdc/pull/3434#discussion_r1667059138



##########
flink-cdc-common/src/main/java/org/apache/flink/cdc/common/data/binary/BinaryArrayData.java:
##########
@@ -0,0 +1,574 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.cdc.common.data.binary;
+
+import org.apache.flink.cdc.common.data.ArrayData;
+import org.apache.flink.cdc.common.data.DecimalData;
+import org.apache.flink.cdc.common.data.LocalZonedTimestampData;
+import org.apache.flink.cdc.common.data.MapData;
+import org.apache.flink.cdc.common.data.RecordData;
+import org.apache.flink.cdc.common.data.StringData;
+import org.apache.flink.cdc.common.data.TimestampData;
+import org.apache.flink.cdc.common.data.ZonedTimestampData;
+import org.apache.flink.cdc.common.types.DataType;
+import org.apache.flink.cdc.common.types.utils.DataTypeUtils;
+import org.apache.flink.core.memory.MemorySegment;
+import org.apache.flink.core.memory.MemorySegmentFactory;
+
+import java.lang.reflect.Array;
+
+import static org.apache.flink.core.memory.MemoryUtils.UNSAFE;
+
+/**
+ * A binary implementation of {@link ArrayData} which is backed by {@link 
MemorySegment}s.
+ *
+ * <p>For fields that hold fixed-length primitive types, such as long, double 
or int, they are
+ * stored compacted in bytes, just like the original java array.
+ *
+ * <p>The binary layout of {@link BinaryArrayData}:
+ *
+ * <pre>
+ * [size(int)] + [null bits(4-byte word boundaries)] + [values or 
offset&length] + [variable length part].
+ * </pre>
+ */
+public final class BinaryArrayData extends BinarySection implements ArrayData {
+
+    /** Offset for Arrays. */
+    private static final int BYTE_ARRAY_BASE_OFFSET = 
UNSAFE.arrayBaseOffset(byte[].class);
+
+    private static final int BOOLEAN_ARRAY_OFFSET = 
UNSAFE.arrayBaseOffset(boolean[].class);
+    private static final int SHORT_ARRAY_OFFSET = 
UNSAFE.arrayBaseOffset(short[].class);
+    private static final int INT_ARRAY_OFFSET = 
UNSAFE.arrayBaseOffset(int[].class);
+    private static final int LONG_ARRAY_OFFSET = 
UNSAFE.arrayBaseOffset(long[].class);
+    private static final int FLOAT_ARRAY_OFFSET = 
UNSAFE.arrayBaseOffset(float[].class);
+    private static final int DOUBLE_ARRAY_OFFSET = 
UNSAFE.arrayBaseOffset(double[].class);
+
+    public static int calculateHeaderInBytes(int numFields) {
+        return 4 + ((numFields + 31) / 32) * 4;

Review Comment:
   As per my understanding:
   The calculateHeaderInBytes method is used to determine the size of the 
header for a binary array in bytes. The header consists of the size of the 
array AND a bitmap indicating which elements are null.
   
   - Fixed Size Path: The 4 at the beginning represents the number of bytes 
used to store the size of the array. This is a fixed overhead for any binary 
array, regardless of the number of elements.
   - Null Bitmap: The second part of the calculation is ((numFields + 31) / 32) 
* 4. This part determines the number of bytes required to store the null bitmap.
   
   
   **Why ((numFields + 31) / 32) * 4?:**
   
   numFields: This is the number of elements in the array.
   + 31 and / 32: This is a mathematical trick to ensure that we have enough 
bits to represent all the elements in the array. Each element in the array 
needs 1 bit in the null bitmap.
   " +31" ensures that we round up to the next multiple of .32 (to ensure word 
boundary) so we always allocate enough bits
   "/32" converts the number of elements to number of 32 bit words needed. 
   Example:
   numFields = 64
   1) fixed part will have 4 bytes with integer value 64
   2) (64+31)/32 = 95/32 = 2.96875 = 3 (rounded up)
        then convert words to bytes: 3*4 = 12 bytes for bitmap
   
   so for example 64 entries we have 
   4 bytes fixed size plus
   12 bytes for bitmap (that is 12*8 = 96 bits so space for upto 96 entries)
   
   Overall the formula  at top of this file
   
   `[size(int)] + [null bits(4-byte word boundaries)] + [values or 
offset&length] + [variable length part].`
   
   represents how data is stored. 
   Example 1:  array of 64 integers
   [4 bytes (size)] + [12 bytes (null bits)] + [256 bytes (values)] = 272 bytes 
total
   
   Example 2: array of 64 variable-length elements:
   [4 bytes (size)] + [12 bytes (null bits)] + [64 * 8 bytes (offset & length)] 
+ [variable length data]
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Handle map and array types binary record data [flink-cdc]

Reply via email to