[GitHub] [hudi] calleo opened a new issue #2975: [SUPPORT]

GitBox Thu, 20 May 2021 23:46:51 -0700


calleo opened a new issue #2975:
URL: https://github.com/apache/hudi/issues/2975



   **Describe the problem you faced**
   
   I would like to read records from a Hudi table using the record key, to 
avoid having to scan the entire table. 
   
   I've read through the examples on how to query a Hudi table and the Spark 
Datasource mentions `read(keys)` but it's very unclear on how to apply this 
when using PySpark.
   
   What I am doing is reading data from a source table (non hudi), then 
transforming it and writing it to a target Hudi table. Sometimes this involves 
updating existing records in the target, but the merge logic is less than 
trivial. So the approach I am taking is:
   
   1. Read new rows from source => df1
   2. Read rows to be updated from target (this is where reading by record key 
would help) => df2
   3. Union df1 & df2, transform the data => transformed_df
   4. Upsert target using transformed_df
   
   **Expected behavior**
   
   Read from Hudi table using record keys.
   
   **Environment Description**
   
   * Hudi version : 0.5.3
   
   * Spark version : 2.4.3 (using AWS Glue 2.0, PySpark)
   
   * Hive version : AWS Glue Catalog
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) :
   
   * Running on Docker? (yes/no) : 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] calleo opened a new issue #2975: [SUPPORT]

Reply via email to