[GitHub] [arrow] davisusanibar commented on issue #37323: java dataset jar HDFS not supported.

via GitHub Wed, 23 Aug 2023 10:30:42 -0700


davisusanibar commented on issue #37323:
URL: https://github.com/apache/arrow/issues/37323#issuecomment-1690360880


   Hi @zinking ,
   
   You could try with this:
   
   1. Go to arrow/java/dataset
   2. Get your absolute path of your new dataset native library:  i.e 
$PWD/lib/x86_64/libarrow_dataset_jni.so
   3. Build new Java Dataset module:
   `mvn clean install -Darrow.cpp.build.dir=$PWD/lib`
   5. Test your new dataset library with:
   ```java
   import org.apache.arrow.dataset.file.FileFormat;
   import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
   import org.apache.arrow.dataset.jni.NativeMemoryPool;
   import org.apache.arrow.dataset.scanner.ScanOptions;
   import org.apache.arrow.dataset.scanner.Scanner;
   import org.apache.arrow.dataset.source.Dataset;
   import org.apache.arrow.dataset.source.DatasetFactory;
   import org.apache.arrow.memory.BufferAllocator;
   import org.apache.arrow.memory.RootAllocator;
   import org.apache.arrow.vector.ipc.ArrowReader;
   import org.apache.arrow.vector.types.pojo.Schema;
   
   public class ReadHdfsParquet {
       public static void main(String[] args) {
           //declare JVM environment variable: HADOOP_HOME = 
/Users/dsusanibar/hadoop-3.3.2
           //where to search for: lib/native/libhdfs.dylib
           String uri = 
"hdfs://localhost:9000/Users/dsusanibar/data4_2rg_gzip.parquet";
           ScanOptions options = new ScanOptions(/*batchSize*/ 32768);
           try (
               BufferAllocator allocator = new RootAllocator();
               DatasetFactory datasetFactory = new 
FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(), 
FileFormat.PARQUET, uri);
               Dataset dataset = datasetFactory.finish();
               Scanner scanner = dataset.newScan(options);
               ArrowReader reader = scanner.scanBatches()
           ) {
               Schema schema = scanner.schema();
               System.out.println(schema);
               while (reader.loadNextBatch()) {
                   
System.out.println(reader.getVectorSchemaRoot().contentToTSVString());
                   System.out.println("RowCount: " + 
reader.getVectorSchemaRoot().getRowCount());
               }
           } catch (Exception e) {
               e.printStackTrace();
           }
       }
   }
   ```
   
   I was able to read HDFS parquet files, but the program will not shut down 
for some reason. 
   
   
![image](https://github.com/apache/arrow/assets/4554485/cb425179-ba79-4246-9fab-696b2902d8c2)
   
   **Please let me know if you obtain the same result on your side.**
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] davisusanibar commented on issue #37323: java dataset jar HDFS not supported.

Reply via email to