[GitHub] [hudi] ygordefraga opened a new issue #2280: [SUPPORT] Simple record key and composite partition

GitBox Wed, 25 Nov 2020 06:10:37 -0800


ygordefraga opened a new issue #2280:
URL: https://github.com/apache/hudi/issues/2280



   Hi there,
   
   I have a problem with Hudi partition when there is only one record key and 
there are multiple partition fields. I tried to make it work using different 
approaches, but none of them worked fine.
   
   The DataFrame i'm working with is pretty simple. It contains all the 
messages sent by the organizations. Example below:
   
   message_id |timestamp|status|organization_id|year|month|day
   ------------ | -------------|-------------| -------------| -------------| 
-------------| -------------
   bdabfa6f-2a3e-4c17-acd7-350227473ae4  | 2020-11-25T10:00:00Z|SENT | 
0b38bec3-15ac-4e57-9bb9-48d7de412ffa | 2020 | 11 | 25
   203d5495-9b5d-4003-b7f3-ab312a70db40 |2020-11-25T11:00:00Z |SENT | 
75e498d4-c979-4a12-b8df-1051c7976d34 | 2020 | 11 | 25
   09fa0543-cf5a-4e6b-9d16-ad14a8a7058a | 2020-10-22T09:00:00Z|NOT_SENT | 
0b38bec3-15ac-4e57-9bb9-48d7de412ffa | 2020 | 10 | 22
   
   This DataFrame will become a COW table and the configs are set like these:
   
   ```
   "hoodie.datasource.write.insert.drop.duplicates" -> "true"
   "hoodie.insert.shuffle.parallelism" -> "32"
   "hoodie.finalize.write.parallelism" -> "32"
   "hoodie.datasource.write.recordkey.field" -> "message_id"
   "hoodie.datasource.write.precombine.field" -> timestampColumn
   "hoodie.datasource.write.partitionpath.field" -> 
"organization_id:SIMPLE,year:SIMPLE,month:SIMPLE,day:SIMPLE"
   "hoodie.datasource.write.keygenerator.class" -> 
classOf[ComplexKeyGenerator].getName
   ```
   I followed [official 
docs](https://hudi.apache.org/docs/writing_data.html#key-generation)  to set 
`"hoodie.datasource.write.partitionpath.field"`. I decided to extract `year`, 
`month` and `day` from `timestamp` to facilitate the partition.
   
   The **problem** is that when I write the table like the way I just showed, 
this is how the partitions will look like  
`messages/default/default/default/default`, but I want my table **to look like 
this** 
`messages/organization_id=<organization_id>/year=<year>/month=<month>/day=<day>`.
   
   Besides that, it worked when I set one more column with the value 
`organization_id=<organization_id>/year=<year>/month=<month>/day=<day>` for 
each row. This column was the value of the config 
`"hoodie.datasource.write.partitionpath.field"`. When this was executed, spark 
job took half an hour to write 300k rows (1000 organizations).
   
   How can I make it work correctly?
   
   > Hudi Version = 0.5.2
   > Spark Version = 2.4.5
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] ygordefraga opened a new issue #2280: [SUPPORT] Simple record key and composite partition

Reply via email to