[GitHub] [hudi] yihua commented on a change in pull request #3952: [HUDI-2102]support hilbert curve for hudi.

GitBox Fri, 26 Nov 2021 10:26:51 -0800


yihua commented on a change in pull request #3952:
URL: https://github.com/apache/hudi/pull/3952#discussion_r757629889




##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/optimize/HilbertCurve.java
##########
@@ -0,0 +1,290 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.optimize;
+
+import java.math.BigInteger;
+import java.util.Arrays;
+
+/**
+ * Converts between Hilbert index ({@code BigInteger}) and N-dimensional 
points.
+ *
+ * Note:
+ * <a 
href="https://github.com/davidmoten/hilbert-curve/blob/master/src/main/java/org/davidmoten/hilbert/HilbertCurve.java";>GitHub</a>).
+ * the Licensed of above link is also 
http://www.apache.org/licenses/LICENSE-2.0
+ */
+public final class HilbertCurve {

Review comment:
       Is this class copied from 
https://github.com/davidmoten/hilbert-curve/blob/master/src/main/java/org/davidmoten/hilbert/HilbertCurve.java?
  Could we just add that library as a dependency and have a wrapper class 
around it if needed?
   ```
   <dependency>
       <groupId>com.github.davidmoten</groupId>
       <artifactId>hilbert-curve</artifactId>
       <version>VERSION_HERE</version>
   </dependency>
   ```

##########
File path: 
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/optimize/TestZOrderingUtil.java
##########
@@ -126,4 +126,21 @@ public OrginValueWrapper(T index, T originValue) {
       this.originValue = originValue;
     }
   }
+
+  @Test
+  public void testConvertBytesToLong() {

Review comment:
       Could you add another test for the cases when the length of the byte 
array passed to `convertLongToBytes()` is not 8, where padding logic is 
incurred?

##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/optimize/ZOrderingUtil.java
##########
@@ -176,9 +176,17 @@ public static byte updatePos(byte a, int apos, byte b, int 
bpos) {
 
   public static Long convertStringToLong(String a) {
     byte[] bytes = utf8To8Byte(a);
+    return convertBytesToLong(bytes);
+  }
+
+  public static long convertBytesToLong(byte[] bytes) {
+    byte[] padBytes = bytes;

Review comment:
       nit: can be named as `paddedBytes`

##########
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/spark/SpaceCurveOptimizeHelper.java
##########
@@ -67,40 +69,62 @@
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.Collection;
+import java.util.Iterator;
 import java.util.List;
 import java.util.Map;
 import java.util.stream.Collectors;
 
-public class ZCurveOptimizeHelper {
+public class SpaceCurveOptimizeHelper {
 
   private static final String SPARK_JOB_DESCRIPTION = "spark.job.description";
 
   /**
-   * Create z-order DataFrame directly
-   * first, map all base type data to byte[8], then create z-order DataFrame
+   * Create optimized DataFrame directly
    * only support base type data. 
long,int,short,double,float,string,timestamp,decimal,date,byte
-   * this method is more effective than createZIndexDataFrameBySample
+   * this method is more effective than createOptimizeDataFrameBySample
    *
    * @param df a spark DataFrame holds parquet files to be read.
-   * @param zCols z-sort cols
+   * @param sortCols z-sort/hilbert-sort cols
    * @param fileNum spark partition num
-   * @return a dataFrame sorted by z-order.
+   * @param sortMode layout optimization strategy
+   * @return a dataFrame sorted by z-order/hilbert.

Review comment:
       similar here.

##########
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/spark/SpaceCurveOptimizeHelper.java
##########
@@ -67,40 +69,62 @@
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.Collection;
+import java.util.Iterator;
 import java.util.List;
 import java.util.Map;
 import java.util.stream.Collectors;
 
-public class ZCurveOptimizeHelper {
+public class SpaceCurveOptimizeHelper {
 
   private static final String SPARK_JOB_DESCRIPTION = "spark.job.description";
 
   /**
-   * Create z-order DataFrame directly
-   * first, map all base type data to byte[8], then create z-order DataFrame
+   * Create optimized DataFrame directly
    * only support base type data. 
long,int,short,double,float,string,timestamp,decimal,date,byte
-   * this method is more effective than createZIndexDataFrameBySample
+   * this method is more effective than createOptimizeDataFrameBySample
    *
    * @param df a spark DataFrame holds parquet files to be read.
-   * @param zCols z-sort cols
+   * @param sortCols z-sort/hilbert-sort cols

Review comment:
       nit: `z-sort/hilbert-sort cols` -> `sorting columns`? (no need to 
mention sorting mechanism here, to be general)

##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/optimize/ZOrderingUtil.java
##########
@@ -176,9 +176,17 @@ public static byte updatePos(byte a, int apos, byte b, int 
bpos) {
 
   public static Long convertStringToLong(String a) {
     byte[] bytes = utf8To8Byte(a);
+    return convertBytesToLong(bytes);
+  }
+
+  public static long convertBytesToLong(byte[] bytes) {
+    byte[] padBytes = bytes;
+    if (bytes.length != 8) {
+      padBytes = paddingTo8Byte(bytes);
+    }

Review comment:
       you can simply have `byte[] paddedBytes = paddingTo8Byte(bytes);` since 
inside `paddingTo8Byte()` there is already check for the length.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] yihua commented on a change in pull request #3952: [HUDI-2102]support hilbert curve for hudi.

Reply via email to