Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

via GitHub Tue, 23 Dec 2025 14:25:28 -0800


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2644362952



##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/io/storage/TestHoodieSparkLanceReader.java:
##########
@@ -0,0 +1,513 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.testutils.HoodieTestUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.unsafe.types.UTF8String;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.io.storage.LanceTestUtils.createRow;
+import static 
org.apache.hudi.io.storage.LanceTestUtils.createRowWithMetaFields;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertNotNull;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+/**
+ * Tests for {@link HoodieSparkLanceReader}.
+ */
+public class TestHoodieSparkLanceReader {
+
+  @TempDir
+  File tempDir;
+
+  private HoodieStorage storage;
+  private SparkTaskContextSupplier taskContextSupplier;
+  private String instantTime;
+
+  @BeforeEach
+  public void setUp() throws IOException {
+    storage = HoodieTestUtils.getStorage(tempDir.getAbsolutePath());
+    taskContextSupplier = new SparkTaskContextSupplier();
+    instantTime = "20251201120000000";
+  }
+
+  @AfterEach
+  public void tearDown() throws Exception {
+    if (storage != null) {
+      storage.close();
+    }
+  }
+
+  @Test
+  public void testReadPrimitiveTypes() throws Exception {
+    // Create schema with primitive types
+    StructType schema = new StructType()
+        .add("id", DataTypes.IntegerType, false)
+        .add("name", DataTypes.StringType, true)
+        .add("age", DataTypes.LongType, true)
+        .add("score", DataTypes.DoubleType, true)
+        .add("active", DataTypes.BooleanType, true);
+
+    // Create test data
+    List<InternalRow> expectedRows = new ArrayList<>();
+    expectedRows.add(createRow(1, "Alice", 30L, 95.5, true));
+    expectedRows.add(createRow(2, "Bob", 25L, 87.3, false));
+    expectedRows.add(createRow(3, "Charlie", 35L, 92.1, true));
+
+    // Write and read back
+    StoragePath path = new StoragePath(tempDir.getAbsolutePath() + 
"/test_primitives.lance");
+    try (HoodieSparkLanceReader reader = writeAndCreateReader(path, schema, 
expectedRows)) {
+      // Verify record count
+      assertEquals(expectedRows.size(), reader.getTotalRecords(), "Record 
count should match");
+
+      // Verify schema
+      assertNotNull(reader.getSchema(), "Schema should not be null");
+
+      // Read all records
+      List<InternalRow> actualRows = readAllRows(reader);
+
+      // Verify record count
+      assertEquals(expectedRows.size(), actualRows.size(), "Should read same 
number of records");
+
+      // Verify each record
+      for (int i = 0; i < expectedRows.size(); i++) {
+        InternalRow expected = expectedRows.get(i);
+        InternalRow actual = actualRows.get(i);
+
+        assertEquals(expected.getInt(0), actual.getInt(0), "id field should 
match");

Review Comment:
   @the-other-tim-brown 
   
   When doing an assertion on just the rows such as  ` assertEquals(expected, 
actual);` this fails for 
   ```
   org.opentest4j.AssertionFailedError: 
   Expected :[1,Alice,30,95.5,true]
   Actual   :[0,1,3000000005,1e,4057e00000000000,1,6563696c41]
   ```
   
   The reason is that the actual rows that come back  are in the form of the 
serialized bytes (UnsafeRow) instead of them being the original row value 
(GenericInternalRow).
    
   <img width="700" height="513" alt="Screenshot 2025-12-23 at 3 44 00 PM" 
src="https://github.com/user-attachments/assets/2364466e-3d38-4343-9e9b-1982c39a0949";
 />
   
   The reason we get back unsafe rows are because we have this currently in our 
implementation of providing back an iterator of `UnsafeRow` 
https://github.com/apache/hudi/pull/17632/files#diff-c10a66239458bfdf557797999ca1c2616ddeb70e97898a40c5b10dbcf0dd172eR116,
 as we followed the pattern for what was done in the Spark Parquet Reader as 
well 
https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkParquetReader.java#L118.
 
   
   Its only when we perform a get that the row value is materialized to it's 
orignal value.
   <img width="454" height="46" alt="Screenshot 2025-12-23 at 3 51 25 PM" 
src="https://github.com/user-attachments/assets/dade9807-baee-48a3-b2c6-6082601d5636";
 />
   
   In terms of if this will cause issues in the future, (in one of the original 
prs back in october for the full end to end spark support) 
https://github.com/apache/hudi/pull/14135/files#diff-259e9caf62b3fa46a53bb5044007d37a2f20f23eb18e071287efd1dfdadbd406R83
   
   I believe we had similar code and when writing spark data source tests and 
validating the data round trip to ensure correctness we did not see issues.
   
   Let me know though how you want to proceed on this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

Reply via email to