[GitHub] [hudi] ankur1603 opened a new issue #1986: [SUPPORT]: Possiblity to disable precombine logic

GitBox Wed, 19 Aug 2020 06:38:54 -0700


ankur1603 opened a new issue #1986:
URL: https://github.com/apache/hudi/issues/1986



   **Describe the problem you faced**
   I have a specific scenario for which I am trying to use Apache Hudi.
   
   Here is an example to explain the requirement:
   **Input:**
   ```
   col1,col2,col3,time
   1,a,b,0
   1,a,c,1
   1,a,d,2
   ```
   
   **Expected output:** 
   Basically to capture all the col3 changes along with time
   `1,a,d,Seq((0,b),(1,c),(2,d))`
   
   here the list of properties that I have set:
   ```
   1. recordkey.field -> "col1" 
   2. precombine.field -> "time"
   3. table.type -> "COPY_ON_WRITE"
   4. payload.class -> "MyCustomPayloadClass"
   ```
   
   Is it possible to disable the `preCombine` functionality as it considers 
only the latest record and older records are getting ignored?
   
   Moreover I am kind of struggling with the merge logic 
   - How to alter the schema ? ex Input is `col1,col2,col3,time` => output 
`col1,col2,col3,Seq((col3,time))` I assume it should be possible in 
`getInsertValue()`
   - How to merge the two Records in `combineAndGetUpdateValue()` as described 
above?
   
   Thanks in Advance
   
   **Environment Description**
   
   * Hudi version : 0.5.0-incubating 
   
   * Spark version : 2.2.0
   
   * Hive version : NA
   
   * Hadoop version : 2.7.1
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : No
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] ankur1603 opened a new issue #1986: [SUPPORT]: Possiblity to disable precombine logic

Reply via email to