[ 
https://issues.apache.org/jira/browse/HUDI-15?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970732#comment-16970732
 ] 

sivabalan narayanan edited comment on HUDI-15 at 11/9/19 4:15 PM:
------------------------------------------------------------------

I have a question on usability stand point. Looks like DataFrame is the input 
(DefaultSource) which contains HoodieRecords in case of inserts or updates with 
HoodieRecord schema. In case of deletes, we have two options

a) DataFrame has arbitrary schema. We can derive and fetch values for required 
fields  only to generate HoodieKey (recordKey field key and Partition key can 
be fetched from configs). 

b) DataFrame contains HoodieRecords. We strip HoodieKeys and pass it to 
HoodieWriteClient.delete()

Do we support either of one or both. 

Here is the issue with 1st option

// I am not very conversant yet, just now finding my ways. So not sure if this 
is a real issue.

In HoodieSparkSqlWriter,  these are the steps done before creating a client. 
 * Register Kryo Classes
 * Generate schema from df (AvroConversionUtils.convertStructTypeToAvroSchema)
 * Register schema
 * Create Client. 

My understanding is that, in step 2 above, code expects certain fields which 
may not be present if we go with 1st option. And I am not sure if we can create 
client with some other random schema (i.e. schema pertaining to HoodieKey) and 
not HoodieRecord. 

Please advice me on this. 

 

 

 


was (Author: shivnarayan):
I have a question on usability stand point. Looks like DataFrame is the input 
(DefaultSource) which contains HoodieRecords in case of inserts or updates with 
HoodieRecord schema. In case of deletes, we have two options

a) DataFrame has arbitrary schema. We can derive and fetch values for required 
fields  only to generate HoodieKey (recordKey field key and Partition key can 
be fetched from configs). 

b) DataFrame contains HoodieRecords. We strip HoodieKeys and pass it to 
HoodieWriteClient.delete()

Do we support either of one or both. 

Here is the issue with 1st option

// I am not very conversant yet, just now finding my ways. So not sure if this 
is a real issue.

In HoodieSparkSqlWriter,  these are the steps done before creating a client. 
 * Register Kryo Classes
 * Generate schema from df
 * Register schema
 * Generate KeyGenerator
 * Create Client. 

My understanding is that, in step 2 above, code expects certain fields which 
may not be present if we go with 1st option. And I am not sure if we can create 
client with some other random schema (i.e. schema pertaining to HoodieKey) and 
not HoodieRecord. 

Please advice me on this. 

 

 

 

> Add a delete() API to HoodieWriteClient as well as Spark datasource #531
> ------------------------------------------------------------------------
>
>                 Key: HUDI-15
>                 URL: https://issues.apache.org/jira/browse/HUDI-15
>             Project: Apache Hudi (incubating)
>          Issue Type: New Feature
>          Components: Spark datasource, Write Client
>            Reporter: Vinoth Chandar
>            Assignee: sivabalan narayanan
>            Priority: Major
>             Fix For: 0.5.1
>
>
> Delete API needs to be supported as first class citizen via DeltaStreamer, 
> WriteClient and datasources. Currently there are two ways to delete, soft 
> deletes and hard deletes - https://hudi.apache.org/writing_data.html#deletes. 
> We need to ensure for hard deletes, we are able to leverage 
> EmptyHoodieRecordPayload with just the HoodieKey and empty record value for 
> deleting.
> [https://github.com/uber/hudi/issues/531]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to