Re: [PR] GH-735: Support write arrow record batch [arrow-java]

via GitHub Tue, 18 Nov 2025 19:26:44 -0800


luoyuxia commented on PR #904:
URL: https://github.com/apache/arrow-java/pull/904#issuecomment-3550532985


   > > > > > > @V-Fenil Hi, thanks for your interest. I already built .so(for 
linux), .dylib(for mac). But I don't have windows env, so I can't provide .dll 
for you. To verify this pr, you'll need to build from source, see 
https://github.com/apache/arrow-java?tab=readme-ov-file#building-from-source. 
That's also what I did to verify my pr.
   > > > > > 
   > > > > > 
   > > > > > Hi @luoyuxia I'm testing your PR on linux. Could you share the 
built libarrow_dataset_jni.so file? I can build java but need the native 
library. (more specific my build was success but I can't find .so file)
   > > > > > Total build time was 49 mins And Arrow Java C Data Interface & 
Arrow Java Dataset was only 45 sec each!! So there was no C++ compilation I 
guess, if would be better if you share direct file.
   > > > > 
   > > > > 
   > > > > Of course I can share it. I can share you the 
`libarrow_dataset_jni.so` as well as the jar built with 
`libarrow_dataset_jni.so`. How can I share it? Send it to you email or by other 
way?
   > > > 
   > > > 
   > > > I'm testing PR #904 (native Parquet writer via JNI) on Ubuntu 
22.04/WSL2 with Java 11 and hitting a consistent failure during ParquetWriter 
initialization.
   > > > Setup:
   > > > 
   > > > * Downloaded jni-linux-x86_64 artifacts from CI build (run 
#19222857860)
   > > > * Using Arrow Java 19.0.0-SNAPSHOT with both libarrow_dataset_jni.so 
and
   > > >   libarrow_cdata_jni.so loaded
   > > > * All library dependencies resolved (ldd shows no missing libraries)
   > > > 
   > > > Error: The ParquetWriter constructor fails at line 71 with a memory 
leak error during cleanup:
   > > > java.lang.IllegalStateException: Memory was leaked by query. Memory 
leaked: (128) Allocator(ROOT) 0/128/4998/9223372036854775807 
(res/actual/peak/limit) at 
org.apache.arrow.dataset.file.ParquetWriter.close(ParquetWriter.java:158) at 
org.apache.arrow.dataset.file.ParquetWriter.(ParquetWriter.java:71)
   > > > Analysis: Looking at the bytecode, the constructor creates a 
RootAllocator, then calls either `ArrowSchema.allocateNew()` (line 24) or 
`Data.exportSchema()` (line 37), which throws an exception. The constructor's 
cleanup calls close(), which detects the 128-byte leak from the allocator 
created at line 14.
   > > > Questions:
   > > > 
   > > > 1. Are there additional native libraries or system dependencies 
required beyond
   > > >    libarrow_dataset_jni.so and libarrow_cdata_jni.so?
   > > > 2. Is the CI build fully functional, or does it require Arrow C++ 
runtime libraries
   > > >    to be installed separately?
   > > > 3. What's the expected initialization sequence for ParquetWriter with 
these JNI
   > > >    libraries?
   > > > 
   > > > The Java code is simply: java FileOutputStream fos = new 
FileOutputStream(outputPath); ParquetWriter writer = new ParquetWriter(fos, 
schema);
   > > > Any guidance would be appreciated. Thanks for this PR - looking 
forward to using the native performance!
   > > 
   > > 
   > > Hi, what's your code? I used the following in my local mac os but can't 
reproduce it
   > > ```
   > > ublic class ParquetWriteTest {
   > > 
   > >     @TempDir public Path tempDir;
   > > 
   > >     @Test
   > >     void test() throws Exception{
   > >         String parquetFilePath =
   > >                 Paths.get("testParquetWriter.parquet").toString();
   > >         List<Field> fields =
   > >                 Arrays.asList(
   > >                         Field.nullable("id", new ArrowType.Int(32, 
true)),
   > >                         Field.nullable("name", new ArrowType.Utf8()));
   > >         Schema arrowSchema = new Schema(fields);
   > > 
   > >         int[] ids = new int[] {1, 2, 3, 4, 5};
   > >         String[] names = new String[] {"Alice", "Bob", "Charlie", 
"Diana", "Errrrve"};
   > > 
   > >         // Write Parquet file
   > >         try (BufferAllocator allocator = new 
RootAllocator(Long.MAX_VALUE);
   > >               FileOutputStream fos = new 
FileOutputStream(parquetFilePath);
   > >              ParquetWriter writer = new ParquetWriter(fos, arrowSchema);
   > >              VectorSchemaRoot vectorSchemaRoot = createData(allocator, 
arrowSchema, ids, names)) {
   > >             writer.write(vectorSchemaRoot);
   > >         }
   > >     }
   > > 
   > >     private static VectorSchemaRoot createData(
   > >             BufferAllocator allocator, Schema schema, int[] ids, 
String[] names) {
   > >         // Create VectorSchemaRoot from schema
   > >         VectorSchemaRoot root = VectorSchemaRoot.create(schema, 
allocator);
   > >         // Allocate space for vectors (we'll write 5 rows)
   > >         root.allocateNew();
   > > 
   > >         // Get field vectors
   > >         IntVector idVector = (IntVector) root.getVector("id");
   > >         VarCharVector nameVector = (VarCharVector) 
root.getVector("name");
   > > 
   > >         // Write data to vectors
   > >         for (int i = 0; i < ids.length; i++) {
   > >             idVector.setSafe(i, ids[i]);
   > >             nameVector.setSafe(i, names[i].getBytes());
   > >         }
   > > 
   > >         // Set the row count
   > >         root.setRowCount(ids.length);
   > > 
   > >         return root;
   > >     }
   > > }
   > > 
   > > 
   > > I'll try to find time to reproduce it in linux.
   > > ```
   > 
   > @luoyuxia
   > 
   > Thanks for the quick response! Here's a minimal test case that reproduces 
the issue on Ubuntu 22.04/WSL2:
   > 
   > Test Code: java import org.apache.arrow.dataset.file.ParquetWriter; import 
org.apache.arrow.memory.BufferAllocator; import 
org.apache.arrow.memory.RootAllocator; import org.apache.arrow.vector.*; import 
org.apache.arrow.vector.types.pojo.ArrowType; import 
org.apache.arrow.vector.types.pojo.Field; import 
org.apache.arrow.vector.types.pojo.Schema;
   > 
   > import java.io.File; import java.io.FileOutputStream; import 
java.util.Arrays; import java.util.List;
   > 
   > public class MinimalParquetTest { public static void main(String[] args) 
throws Exception { // Load native libraries (required on Linux) System.load(new 
File("src/main/resources/arrow_cdata_jni/x86_64/libarrow_cdata_jni.so").getAbsolutePath());
 System.load(new 
File("src/main/resources/arrow_dataset_jni/x86_64/libarrow_dataset_jni.so").getAbsolutePath());
   > 
   > ```
   >     String parquetFilePath = "/tmp/test.parquet";
   >     
   >     List<Field> fields = Arrays.asList(
   >         Field.nullable("id", new ArrowType.Int(32, true)),
   >         Field.nullable("name", new ArrowType.Utf8())
   >     );
   >     Schema arrowSchema = new Schema(fields);
   > 
   >     int[] ids = new int[] {1, 2, 3, 4, 5};
   >     String[] names = new String[] {"Alice", "Bob", "Charlie", "Diana", 
"Eve"};
   > 
   >     // THIS LINE FAILS - ParquetWriter constructor throws during 
initialization
   >     try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
   >          FileOutputStream fos = new FileOutputStream(parquetFilePath);
   >          ParquetWriter writer = new ParquetWriter(fos, arrowSchema)) {  // 
← Fails here
   >         
   >         // Never gets here - constructor fails
   >         VectorSchemaRoot root = VectorSchemaRoot.create(arrowSchema, 
allocator);
   >         root.allocateNew();
   >         
   >         IntVector idVector = (IntVector) root.getVector("id");
   >         VarCharVector nameVector = (VarCharVector) root.getVector("name");
   >         
   >         for (int i = 0; i < ids.length; i++) {
   >             idVector.setSafe(i, ids[i]);
   >             nameVector.setSafe(i, names[i].getBytes());
   >         }
   >         
   >         root.setRowCount(ids.length);
   >         writer.write(root);
   >     }
   > }
   > ```
   > 
   > }
   > 
   > Environment:
   > 
   > * OS: Ubuntu 22.04 LTS (WSL2 on Windows 11)
   > * Java: OpenJDK 11.0.25
   > * Arrow Version: 19.0.0-SNAPSHOT from PR [GH-735: Support write arrow 
record batch #904](https://github.com/apache/arrow-java/pull/904)
   > * Native libs: Downloaded from CI run #19222857860 (jni-linux-x86_64.zip)
   > 
   > Maven Dependencies: xml org.apache.arrow arrow-dataset 19.0.0-SNAPSHOT 
org.apache.arrow arrow-c-data 19.0.0-SNAPSHOT org.apache.arrow arrow-vector 
19.0.0-SNAPSHOT org.apache.arrow arrow-memory-netty 19.0.0-SNAPSHOT
   > 
   > Run Command: bash java --add-opens=java.base/java.nio=ALL-UNNAMED -Xmx4G 
-cp "target/classes:target/lib/*" MinimalParquetTest
   > 
   > Stack Trace:
   > 
   > java.lang.IllegalStateException: Memory was leaked by query. Memory 
leaked: (128) Allocator(ROOT) 0/128/4998/9223372036854775807 
(res/actual/peak/limit) at 
org.apache.arrow.memory.BaseAllocator.close(BaseAllocator.java:504) at 
org.apache.arrow.memory.RootAllocator.close(RootAllocator.java:27) at 
org.apache.arrow.dataset.file.ParquetWriter.close(ParquetWriter.java:158) at 
org.apache.arrow.dataset.file.ParquetWriter.(ParquetWriter.java:71)
   > 
   > Key Differences from macOS: On macOS, you likely have Arrow C++ libraries 
installed via Homebrew. On Linux/WSL2, I'm using only the JNI libraries from 
the CI build. Could the JNI libraries need system Arrow C++ libraries to be 
installed separately on Linux?
   > 
   > Things I've verified:
   > 
   > * Both libarrow_dataset_jni.so and libarrow_cdata_jni.so load successfully
   > * No missing library dependencies (ldd shows all resolved)
   > * File permissions are correct (755 on .so files)
   > 
   > Would appreciate any Linux-specific setup steps I might be missing. Thanks!
   
   @V-Fenil Hi, I spent some time debug in linux env and did find some problem.
   
   The `.so` download from ci doesn't include the JNI method I introduce in 
this pr although I don't know why. You can use nm -D `libarrow_dataset_jni.so | 
grep nativeCreateParquetWriter` to check.
   So I built it by `mvn generate-resources -Pgenerate-libs-jni-macos-linux -N` 
according to the guide from 
https://arrow.apache.org/docs/developers/java/building.html#id3. Then, it 
works. If you need, I can send to you the `.so` and the arrow-dataset I built
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-735: Support write arrow record batch [arrow-java]

Reply via email to