afeldman1 commented on a change in pull request #1761:
URL: https://github.com/apache/hudi/pull/1761#discussion_r446695942
##########
File path: docs/_docs/2_2_writing_data.md
##########
@@ -176,15 +176,49 @@ In some cases, you may want to migrate your existing
table into Hudi beforehand.
## Datasource Writer
-The `hudi-spark` module offers the DataSource API to write (and also read) any
data frame into a Hudi table.
-Following is how we can upsert a dataframe, while specifying the field names
that need to be used
-for `recordKey => _row_key`, `partitionPath => partition` and `precombineKey
=> timestamp`
+The `hudi-spark` module offers the DataSource API to write (and read) a Spark
DataFrame into a Hudi table. There are a number of options available:
+**`HoodieWriteConfig`**:
+
+**TABLE_NAME** (Required)<br>
+
+
+**`DataSourceWriteOptions`**:
+
+**RECORDKEY_FIELD_OPT_KEY** (Required): Primary key field(s). Nested fields
can be specified using the dot notation eg: `a.b.c`. When using multiple
columns as primary key use comma seperated notaion, eg: `"col1,col2,col3,etc"`.
Single or multiple columns as primary key specified by
`KEYGENERATOR_CLASS_OPT_KEY` property.<br>
+Default value: `"uuid"`<br>
+
+**PARTITIONPATH_FIELD_OPT_KEY** (Required): Columns to be used for
partitioning the table. To prevent partitioning, provide empty string as value
eg: `""`. Specify paritioning/no partitioning using
`HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY`<br>
+Default value: `"partitionpath"`<br>
+
+**PRECOMBINE_FIELD_OPT_KEY** (Required): When two records have the same key
value, the record with the largest value from the field specified here will be
choosen.<br>
+Default value: `"ts"`<br>
+
+**OPERATION_OPT_KEY**: The [write operations](#write-operations) to use. Note:
this cannot change across writes.<br>
Review comment:
In regards to deletion: I believe method 2 that you mentioned with
delete api, is the same thing as as setting OPERATION_OPT_KEY to
DELETE_OPERATION_OPT_VAL (DELETE_OPERATION_OPT_VAL has a string value of
"delete"). In terms of method 3, I believe this flag method is only applicable
when using the DeltaStreamer based on this blogpost
https://hudi.apache.org/blog/delete-support-in-hudi/ , is this correct? And for
method 1 of setting PAYLOAD_CLASS_OPT_KEY to
"org.apache.hudi.EmptyHoodieRecordPayload", could you please clarify what the
difference is between using method 1 and method 2? It seems the steps are the
same as first a DataFrame must be prepared including the full records that need
to be deleted, and then the write operation must be done using one of the 2
methods..
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]