10183974 opened a new issue, #1057:
URL: https://github.com/apache/arrow-java/issues/1057
### Describe the bug, including details regarding any error messages,
version, and platform.
We are encountering a critical issue when using the Apache Arrow Java
Dataset API to read a large local Parquet file (15GB). The JVM process is
killed by the OS OOM Killer despite having sufficient Docker container memory
(15GB) and conservative JVM heap/direct memory settings.
Additionally, attempting to mitigate this by using a listenable
NativeMemoryPool results in a fatal JNI error (JNIEnv was not attached to
current thread).
Environment
Apache Arrow Version: 17.0.0
OS: Linux
JDK: jdk8
File Format: Parquet (15GB, generated by Pandas/PyArrow default config)
Hardware: Docker container limited to 15GB RAM.
Reproduction Steps
Configuration
Docker Memory Limit: 15GB
JVM Arguments:
java
-Xms1g -Xmx3g
-XX:MaxDirectMemorySize=3g
-XX:MaxMetaspaceSize=256m
-XX:+UseG1GC
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-Xloggc:gc.log
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=./heap_dump.hprof
-Darrow.memory.debug.allocator=true
-jar arrow-test-1.0-SNAPSHOT.jar
Code Logic: Using FileSystemDatasetFactory to read a single 15GB Parquet
file with a batch size of 65,536.
2. Code Snippet
java:
>
Runtime runtime = Runtime.getRuntime();
// 单位转换辅助
long mb = 1024 * 1024;
System.out.println("===== JVM 内存概况 =====");
System.out.println("最大可用内存 (Max Memory): " + runtime.maxMemory() / mb +
" MB");
System.out.println("已分配内存 (Total Memory): " + runtime.totalMemory() / mb
+ " MB");
System.out.println("空闲内存 (Free Memory): " + runtime.freeMemory() / mb +
" MB");
System.out.println("已使用内存 (Used Memory): " + (runtime.totalMemory() -
runtime.freeMemory()) / mb + " MB");
// fileSize: 15G
String hdfsUri =
"hdfs://node01.cdh5:8020/user/hive/warehouse/perf_test_200col_500w_nopk_parquet/ds=1/perf_test_200col_500w_nopk_parquet";
ScanOptions options = new ScanOptions(100000);
try (RootAllocator allocator = new RootAllocator(512 * 1024 * 1024);
NativeMemoryPool nativePool =
NativeMemoryPool.createListenable(DirectReservationListener.instance());
FileSystemDatasetFactory datasetFactory = new
FileSystemDatasetFactory(allocator, nativePool, FileFormat.PARQUET, hdfsUri);
Dataset dataset = datasetFactory.finish();
Scanner scanner = dataset.newScan(options);
ArrowReader reader = scanner.scanBatches()) {
long totalRows = 0;
int batchCount = 0;
while (reader.loadNextBatch()) {
try (VectorSchemaRoot root = reader.getVectorSchemaRoot()) {
int rowCount = root.getRowCount();
totalRows += rowCount;
batchCount++;
System.out.println("Batch " + batchCount + " - rows: " +
rowCount + ", total: " + totalRows);
}
}
System.out.println("\n===== Processing Complete =====");
System.out.println("Total batches: " + batchCount);
System.out.println("Total rows: " + totalRows);
} catch (Exception e) {
logger.error("Error processing parquet file", e);
e.printStackTrace();
}
>
Observed Behavior:
Batch 1 - rows: 41547, total: 41547
/mnt/executor/sandbox/shell-0000001022165.sh: line 21: 92 Killed java -Xms1g
-Xmx3g -XX:MaxDirectMemorySize=3g -XX:MaxMetaspaceSize=256m -XX:+UseG1GC
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:gc.log
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=./heap_dump.hprof
-Darrow.memory.debug.allocator=true -jar
/mnt/executor/sandbox/resources/0000001022165/resource.arrow-test-1.0-SNAPSHOT.jar
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]