PhantomHunt opened a new issue, #8572:
URL: https://github.com/apache/hudi/issues/8572
We have created Hudi datalake with version 0.13.0
We need to read data from a few tables in an incremental fashion.
To fetch the active timelines for a MoR table, we are using the following
piece of code where basePath is the S3 bucket path where data lies :
```
metaClient=(spark._jvm.org.apache.hudi.common.table.HoodieTableMetaClient.builder().setConf(spark._jsc.hadoopConfiguration()).setBasePath(basePath).setLoadActiveTimelineOnLoad(True).build())
timeline=metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants()
instants=
timeline.getInstants().collect(spark._jvm.java.util.stream.Collectors.toList()).toArray()
map_timestamps=map(lambda x : x.getTimestamp(),instants)
for a_ts in map_timestamps:
list_timestamps.append(a_ts)
```
the output would be like this:
`["20230410110310171", "20230410111802858", "20230410135802426",
"20230410233706724", "20230411070325158", "20230412075305123",
"20230412104308890", "20230412112440414", "20230412123348380",
"20230412123408573", "20230412123426951", "20230412143444989",
"20230412143503391", "20230413104721504", "20230413120831774",
"20230413122750909", "20230413153023354", "20230414045300420",
"20230414105813727", "20230414110336441", "20230414111346898",
"20230414142833034", "20230414145746900", "20230414145806366",
"20230418070525211", "20230418095219696", "20230419055721930",
"20230419065820905", "20230419100813940", "20230419111328181",
"20230419160833254", "20230420055335537", "20230420055920454",
"20230420061332289", "20230420080546225", "20230420080606133",
"20230420081457928", "20230420090733990", "20230420092736555",
"20230420093835449", "20230420135133130"]`
When we tried to read MoR table:
```
mor={
'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
'hoodie.datasource.query.type':'incremental',
'hoodie.datasource.read.begin.instanttime':'20230331130832572',
'hoodie.datasource.read.end.instanttime':'20230410110310171'
}
try:
df=spark.read.format("org.apache.hudi").options(**mor).load(path_to_table)
return df
except Exception as e:
log.msg(e,"e")
return None
```
we got the following error
```
23/04/20 14:25:33 ERROR Executor: Exception in task 0.0 in stage 11.0 (TID
36)
java.io.FileNotFoundException: No such file or directory:
s3a://****/2df00b9b-9fae-45a4-8492-e11ef16740b3-0_0-298-497_20230410110310171.parquet
```
The writer configuration is -
```
writer_config = {"fs.s3a.impl""fs.s3a.impl"
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'cdc_timestamp',
'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS',
'hoodie.schema.on.read.enable' : "true",
'hoodie.datasource.write.reconcile.schema' : "true",
'hoodie.datasource.write.keygenerator.class':
'org.apache.hudi.keygen.NonpartitionedKeyGenerator',
'hoodie.table.name': table_name,
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.table.name': table_name,
'hoodie.upsert.shuffle.parallelism': 200,
'hoodie.keep.max.commits': 50,
'hoodie.keep.min.commits': 40,
'hoodie.cleaner.commits.retained': 30
}
```
Language - Python
Hudi Version - 0.13.0
Job Type - Python script on EC2
Table Type - Non Partitioned MOR
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]