10183974 opened a new issue, #49472:
URL: https://github.com/apache/arrow/issues/49472

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   We are encountering a critical issue when using the Apache Arrow Java 
Dataset API to read a large local Parquet file (15GB). The JVM process is 
killed by the OS OOM Killer despite having sufficient Docker container memory 
(15GB) and conservative JVM heap/direct memory settings.
   
   Additionally, attempting to mitigate this by using a listenable 
NativeMemoryPool results in a fatal JNI error (JNIEnv was not attached to 
current thread).
   
   Environment
   Apache Arrow Version: 17.0.0
   OS: Linux 
   JDK: jdk8
   File Format: Parquet (15GB, generated by Pandas/PyArrow default config)
   Hardware: Docker container limited to 15GB RAM.
   
   Reproduction Steps
   1. Configuration
   Docker Memory Limit: 15GB
   JVM Arguments:
   java \
     -Xms1g -Xmx3g \
     -XX:MaxDirectMemorySize=3g \
     -XX:MaxMetaspaceSize=256m \
     -XX:+UseG1GC \
     -XX:+PrintGCDetails \
     -XX:+PrintGCDateStamps \
     -Xloggc:gc.log \
     -XX:+HeapDumpOnOutOfMemoryError \
     -XX:HeapDumpPath=./heap_dump.hprof \
     -Darrow.memory.debug.allocator=true \
     -jar arrow-test-1.0-SNAPSHOT.jar
   
   Code Logic: Using FileSystemDatasetFactory to read a single 15GB Parquet 
file with a batch size of 65,536.
   2. Code Snippet
   java:
   > Runtime runtime = Runtime.getRuntime();
   
           // 单位转换辅助
           long mb = 1024 * 1024;
           System.out.println("===== JVM 内存概况 =====");
           System.out.println("最大可用内存 (Max Memory): " + runtime.maxMemory() / 
mb + " MB");
           System.out.println("已分配内存 (Total Memory): " + runtime.totalMemory() 
/ mb + " MB");
           System.out.println("空闲内存 (Free Memory): " + runtime.freeMemory() / 
mb + " MB");
           System.out.println("已使用内存 (Used Memory): " + (runtime.totalMemory() 
- runtime.freeMemory()) / mb + " MB");
   
           // fileSize: 15G
           String hdfsUri = 
"hdfs://node01.cdh5:8020/user/hive/warehouse/perf_test_200col_500w_nopk_parquet/ds=1/perf_test_200col_500w_nopk_parquet";
   
           ScanOptions options = new ScanOptions(100000);
           try (RootAllocator allocator = new RootAllocator(512 * 1024 * 1024);
                NativeMemoryPool nativePool = 
NativeMemoryPool.createListenable(DirectReservationListener.instance());
                FileSystemDatasetFactory datasetFactory = new 
FileSystemDatasetFactory(allocator, nativePool, FileFormat.PARQUET, hdfsUri);
                Dataset dataset = datasetFactory.finish();
                Scanner scanner = dataset.newScan(options);
                ArrowReader reader = scanner.scanBatches()) {
   
               long totalRows = 0;
               int batchCount = 0;
   
               while (reader.loadNextBatch()) {
                   try (VectorSchemaRoot root = reader.getVectorSchemaRoot()) {
                       int rowCount = root.getRowCount();
                       totalRows += rowCount;
                       batchCount++;
   
                       System.out.println("Batch " + batchCount + " - rows: " + 
rowCount + ", total: " + totalRows);
                   }
               }
   
               System.out.println("\n===== Processing Complete =====");
               System.out.println("Total batches: " + batchCount);
               System.out.println("Total rows: " + totalRows);
           } catch (Exception e) {
               logger.error("Error processing parquet file", e);
               e.printStackTrace();
           }
   > 
   
   Observed Behavior
   Batch 1 - rows: 41547, total: 41547
   /mnt/executor/sandbox/shell-0000001022165.sh: line 21:    92 Killed          
        java -Xms1g -Xmx3g -XX:MaxDirectMemorySize=3g -XX:MaxMetaspaceSize=256m 
-XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:gc.log 
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=./heap_dump.hprof 
-Darrow.memory.debug.allocator=true -jar 
/mnt/executor/sandbox/resources/0000001022165/resource.arrow-test-1.0-SNAPSHOT.jar
   
   ### Component(s)
   
   Java


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to