zhangyue19921010 opened a new issue, #13030:
URL: https://github.com/apache/arrow/issues/13030
Hi Team,
I am developing HoodieArrowParquetFileFormat, aiming to trigger spark sql
and query hudi data throw arrow.
Based on spark sql abstraction, I need to access partial parquet file
[offset0, offset1], Is it possible to use newScan api?
For example, can we set `start_offset, length, file_format` directly?
```
JNIEXPORT jlong JNICALL
Java_org_apache_arrow_dataset_file_JniWrapper_makeFileSystemDatasetFactory(
JNIEnv* env, jobject, jstring uri, jlong file_format_id,
jlong start_offset, jlong length) {
JNI_METHOD_START
std::shared_ptr<arrow::dataset::FileFormat> file_format =
JniGetOrThrow(GetFileFormat(file_format_id));
arrow::dataset::FileSystemFactoryOptions options;
std::shared_ptr<arrow::dataset::DatasetFactory> d =
JniGetOrThrow(arrow::dataset::FileSystemDatasetFactory::Make(
JStringToCString(env, uri), start_offset, length, file_format,
options));
return CreateNativeRef(d);
JNI_METHOD_END(-1L)
}
```
```
(file: PartitionedFile) => {
val allocator = HoodieArrowUtils.getAllocator()
val factory = HoodieArrowUtils.getDatasetFactory(allocator,
file.filePath)
val dataset =
factory.finish(HoodieArrowUtils.toArrowSchema(requiredSchema,
HoodieArrowUtils.getLocalTimezoneID()))
val scanOptions = new ScanOptions(batchSize)
val scanner = dataset.newScan(scanOptions)
Option(TaskContext.get()).foreach(_.addTaskCompletionListener[Unit](_
=> {
scanner.close()
dataset.close()
factory.close()
}))
val itr = scanner.scan().iterator()
.asScala.toList
.flatMap(task => task.execute().asScala.toList)
.map(batch => HoodieArrowUtils.loadBatch(batch,
file.partitionValues, partitionSchema, requiredSchema, allocator))
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]