[I] [SUPPORT] select lots of values via Record Index [hudi]

via GitHub Wed, 12 Jun 2024 00:12:20 -0700


ziudu opened a new issue, #11438:
URL: https://github.com/apache/hudi/issues/11438


   Dear, I think record index is an amazing feature. However, I got some 
problem if I want to select a lots of records:
   
   Backgroud：
   hudi_table_path = ("/ilw/test/...")
   updated_sub_df = spark.read.format("hudi") \
       .option("hoodie.metadata.enable", "true") \
       .option("hoodie.metadata.record.index.enable", "true") \
       .option("hoodie.enable.data.skipping", "true") \
       .load(hudi_table_path)
   
   Method 1: Record Index and file pruning will work, but if the list is huge, 
1 million for example, Spark would hang, I guess, at Catalyst Optimizer step, 
since it would convert the list to SQL query before the execution of Physical 
Plan. 
     
      values = list(range(1, 10001))
      query_df = updated_sub_df.filter(updated_sub_df.new_id.isin(values))
      print(query_df.count())
   
   Method 2: we searched on Google, and found one way to broadcast the list 
with UDF, as follows. The Catalyst Optimizer worked fine, but Record Index, as 
well as file pruning was not working.  
   
   values = list(range(1, 20001))
   broadcast_values = spark.sparkContext.broadcast(values)
   isin_values_udf = F.udf(lambda x: x in broadcast_values.value, BooleanType())
   query_df = updated_sub_df.where(isin_values_udf(updated_sub_df['new_id']))
   print("Count values: ", query_df.count())
   
   Is there anyway to select lots of values via Record Index? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] select lots of values via Record Index [hudi]

Reply via email to