[GitHub] [arrow] zhangyue19921010 opened a new issue, #13030: [JAVA] Is any way reading partial parquet file into arrow

GitBox Fri, 29 Apr 2022 02:17:39 -0700


zhangyue19921010 opened a new issue, #13030:
URL: https://github.com/apache/arrow/issues/13030


   Hi Team,
   
   I am developing HoodieArrowParquetFileFormat, aiming to trigger spark sql 
and query hudi data throw arrow.
   
   Based on spark sql abstraction, I need to access partial parquet file 
[offset0, offset1], Is it possible to  use newScan api?
   For example, can we set `start_offset, length, file_format` directly?
   ```
   JNIEXPORT jlong JNICALL
   Java_org_apache_arrow_dataset_file_JniWrapper_makeFileSystemDatasetFactory(
       JNIEnv* env, jobject, jstring uri, jlong file_format_id,
       jlong start_offset, jlong length) {
     JNI_METHOD_START
     std::shared_ptr<arrow::dataset::FileFormat> file_format =
         JniGetOrThrow(GetFileFormat(file_format_id));
     arrow::dataset::FileSystemFactoryOptions options;
     std::shared_ptr<arrow::dataset::DatasetFactory> d =
         JniGetOrThrow(arrow::dataset::FileSystemDatasetFactory::Make(
             JStringToCString(env, uri), start_offset, length, file_format, 
options));
     return CreateNativeRef(d);
     JNI_METHOD_END(-1L)
   }
   ```
   
   ```
       (file: PartitionedFile) => {
         val allocator = HoodieArrowUtils.getAllocator()
         val factory = HoodieArrowUtils.getDatasetFactory(allocator, 
file.filePath)
         val dataset = 
factory.finish(HoodieArrowUtils.toArrowSchema(requiredSchema, 
HoodieArrowUtils.getLocalTimezoneID()))
   
         val scanOptions = new ScanOptions(batchSize)
         val scanner = dataset.newScan(scanOptions)
   
         Option(TaskContext.get()).foreach(_.addTaskCompletionListener[Unit](_ 
=> {
           scanner.close()
           dataset.close()
           factory.close()
         }))
   
         val itr = scanner.scan().iterator()
           .asScala.toList
           .flatMap(task => task.execute().asScala.toList)
           .map(batch => HoodieArrowUtils.loadBatch(batch, 
file.partitionValues, partitionSchema, requiredSchema, allocator))
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] zhangyue19921010 opened a new issue, #13030: [JAVA] Is any way reading partial parquet file into arrow

Reply via email to