calleo opened a new issue #2975: URL: https://github.com/apache/hudi/issues/2975
**Describe the problem you faced** I would like to read records from a Hudi table using the record key, to avoid having to scan the entire table. I've read through the examples on how to query a Hudi table and the Spark Datasource mentions `read(keys)` but it's very unclear on how to apply this when using PySpark. What I am doing is reading data from a source table (non hudi), then transforming it and writing it to a target Hudi table. Sometimes this involves updating existing records in the target, but the merge logic is less than trivial. So the approach I am taking is: 1. Read new rows from source => df1 2. Read rows to be updated from target (this is where reading by record key would help) => df2 3. Union df1 & df2, transform the data => transformed_df 4. Upsert target using transformed_df **Expected behavior** Read from Hudi table using record keys. **Environment Description** * Hudi version : 0.5.3 * Spark version : 2.4.3 (using AWS Glue 2.0, PySpark) * Hive version : AWS Glue Catalog * Hadoop version : * Storage (HDFS/S3/GCS..) : * Running on Docker? (yes/no) : -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
