[GitHub] [arrow] davisusanibar commented on issue #36069: [Java] while using s3 FileSystemDatasetFactory getting this exception

via GitHub Wed, 14 Jun 2023 15:30:27 -0700


davisusanibar commented on issue #36069:
URL: https://github.com/apache/arrow/issues/36069#issuecomment-1592076111


   > CC @davisusanibar @lidavidm (note that this warning was newly added to the 
S3 filesystem in the previous release so it is very possible the Java 
implementation has never been calling finalize)
   
   Just able to reproduce this warning with:
   
   ```java
   import org.apache.arrow.dataset.file.FileFormat;
   import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
   import org.apache.arrow.dataset.jni.NativeMemoryPool;
   import org.apache.arrow.dataset.scanner.ScanOptions;
   import org.apache.arrow.dataset.scanner.Scanner;
   import org.apache.arrow.dataset.source.Dataset;
   import org.apache.arrow.dataset.source.DatasetFactory;
   import org.apache.arrow.memory.BufferAllocator;
   import org.apache.arrow.memory.RootAllocator;
   import org.apache.arrow.vector.ipc.ArrowReader;
   import org.apache.arrow.vector.types.pojo.Schema;
   
   public class DatasetModule {
       public static void main(String[] args) {
           String uri = 
"s3://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet";
 // AWS S3
           // String uri = 
"hdfs://{hdfs_host}:{port}/nyc-taxi-tiny/year=2022/month=2/part-0.parquet"; // 
HDFS
           // String uri = 
"gs://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet";
 // Google Cloud Storage
           ScanOptions options = new ScanOptions(/*batchSize*/ 32768);
           try (
               BufferAllocator allocator = new RootAllocator();
               DatasetFactory datasetFactory = new 
FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(), 
FileFormat.PARQUET, uri);
               Dataset dataset = datasetFactory.finish();
               Scanner scanner = dataset.newScan(options);
               ArrowReader reader = scanner.scanBatches()
           ) {
               Schema schema = scanner.schema();
               System.out.println(schema);
               while (reader.loadNextBatch()) {
                   System.out.println("RowCount: " + 
reader.getVectorSchemaRoot().getRowCount());
               }
           } catch (Exception e) {
               e.printStackTrace();
           }
       }
   }
   ```
   
   Output messages:
   ```
   RowCount: 2979
   
/Users/runner/work/crossbow/crossbow/arrow/cpp/src/arrow/filesystem/s3fs.cc:2598:
  arrow::fs::FinalizeS3 was not called even though S3 was initialized.  This 
could lead to a segmentation fault at exit
   ```
   
   Next step:
   1. Review reason of error messages
   2. Add Arrow Java cookbook to cover S3 integration
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] davisusanibar commented on issue #36069: [Java] while using s3 FileSystemDatasetFactory getting this exception

Reply via email to