10183974 opened a new issue, #49464:
URL: https://github.com/apache/arrow/issues/49464

   When using the Apache Arrow Java Dataset API (FileSystemDatasetFactory) to 
read ORC files directly from HDFS, the JVM fails to exit gracefully after the 
reading process is complete. The application hangs indefinitely because two 
non-daemon native threads (started by the underlying libhdfs JNI layer) remain 
active.
   
   This issue does not occur when reading local files; the JVM exits normally 
in that scenario. It specifically affects HDFS interactions where the C++ 
libhdfs client is loaded via JNI.
   
   I have tested this on Apache Arrow version 9.0.0 (the last version with 
built-in HDFS support) and confirmed the behavior persists. I also attempted 
version 17.0.0 with the separate arrow-hdfs module, but the thread leakage 
behavior remains the same once the dependency is added.
   
   Currently, the only workaround is to force terminate the JVM using 
System.exit(0), which is not ideal for applications relying on shutdown hooks 
or running within complex containers.
   
   here is java code:  
   String hdfsUri = 
"hdfs://node01.cdh5:8020/user/hive/warehouse/perf_test_200col_500w_nopk/ds=2/perf_test_200col_500w_nopk";
           ScanOptions options = new ScanOptions(/*batchSize*/ 32768);
           try (
               BufferAllocator allocator = new RootAllocator();
               NativeMemoryPool pool = NativeMemoryPool.getDefault();
               DatasetFactory datasetFactory = new 
FileSystemDatasetFactory(allocator, pool, FileFormat.ORC, hdfsUri);
               Dataset dataset = datasetFactory.finish();
               Scanner scanner = dataset.newScan(options);
               ArrowReader reader = scanner.scanBatches()
           ) {
               int totalBatchSize = 0;
               while (reader.loadNextBatch()) {
                   try (VectorSchemaRoot root = reader.getVectorSchemaRoot()) {
                       totalBatchSize += root.getRowCount();
                   }
               }
               System.out.println("Total batch size: " + totalBatchSize);
           } catch (Exception e) {
               e.printStackTrace();
           }


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to