10183974 opened a new issue, #49464:
URL: https://github.com/apache/arrow/issues/49464
When using the Apache Arrow Java Dataset API (FileSystemDatasetFactory) to
read ORC files directly from HDFS, the JVM fails to exit gracefully after the
reading process is complete. The application hangs indefinitely because two
non-daemon native threads (started by the underlying libhdfs JNI layer) remain
active.
This issue does not occur when reading local files; the JVM exits normally
in that scenario. It specifically affects HDFS interactions where the C++
libhdfs client is loaded via JNI.
I have tested this on Apache Arrow version 9.0.0 (the last version with
built-in HDFS support) and confirmed the behavior persists. I also attempted
version 17.0.0 with the separate arrow-hdfs module, but the thread leakage
behavior remains the same once the dependency is added.
Currently, the only workaround is to force terminate the JVM using
System.exit(0), which is not ideal for applications relying on shutdown hooks
or running within complex containers.
here is java code:
String hdfsUri =
"hdfs://node01.cdh5:8020/user/hive/warehouse/perf_test_200col_500w_nopk/ds=2/perf_test_200col_500w_nopk";
ScanOptions options = new ScanOptions(/*batchSize*/ 32768);
try (
BufferAllocator allocator = new RootAllocator();
NativeMemoryPool pool = NativeMemoryPool.getDefault();
DatasetFactory datasetFactory = new
FileSystemDatasetFactory(allocator, pool, FileFormat.ORC, hdfsUri);
Dataset dataset = datasetFactory.finish();
Scanner scanner = dataset.newScan(options);
ArrowReader reader = scanner.scanBatches()
) {
int totalBatchSize = 0;
while (reader.loadNextBatch()) {
try (VectorSchemaRoot root = reader.getVectorSchemaRoot()) {
totalBatchSize += root.getRowCount();
}
}
System.out.println("Total batch size: " + totalBatchSize);
} catch (Exception e) {
e.printStackTrace();
}
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]