ziudu opened a new issue, #11438:
URL: https://github.com/apache/hudi/issues/11438
Dear, I think record index is an amazing feature. However, I got some
problem if I want to select a lots of records:
Backgroud:
hudi_table_path = ("/ilw/test/...")
updated_sub_df = spark.read.format("hudi") \
.option("hoodie.metadata.enable", "true") \
.option("hoodie.metadata.record.index.enable", "true") \
.option("hoodie.enable.data.skipping", "true") \
.load(hudi_table_path)
Method 1: Record Index and file pruning will work, but if the list is huge,
1 million for example, Spark would hang, I guess, at Catalyst Optimizer step,
since it would convert the list to SQL query before the execution of Physical
Plan.
values = list(range(1, 10001))
query_df = updated_sub_df.filter(updated_sub_df.new_id.isin(values))
print(query_df.count())
Method 2: we searched on Google, and found one way to broadcast the list
with UDF, as follows. The Catalyst Optimizer worked fine, but Record Index, as
well as file pruning was not working.
values = list(range(1, 20001))
broadcast_values = spark.sparkContext.broadcast(values)
isin_values_udf = F.udf(lambda x: x in broadcast_values.value, BooleanType())
query_df = updated_sub_df.where(isin_values_udf(updated_sub_df['new_id']))
print("Count values: ", query_df.count())
Is there anyway to select lots of values via Record Index?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]