[GitHub] [hudi] afeldman1 commented on a change in pull request #1761: [MINOR] Add documentation for using multi-column table keys and for n…

GitBox Mon, 29 Jun 2020 11:40:59 -0700


afeldman1 commented on a change in pull request #1761:
URL: https://github.com/apache/hudi/pull/1761#discussion_r447175985




##########
File path: docs/_docs/2_2_writing_data.md
##########
@@ -176,15 +176,49 @@ In some cases, you may want to migrate your existing 
table into Hudi beforehand.
 
 ## Datasource Writer
 
-The `hudi-spark` module offers the DataSource API to write (and also read) any 
data frame into a Hudi table.
-Following is how we can upsert a dataframe, while specifying the field names 
that need to be used
-for `recordKey => _row_key`, `partitionPath => partition` and `precombineKey 
=> timestamp`
+The `hudi-spark` module offers the DataSource API to write (and read) a Spark 
DataFrame into a Hudi table. There are a number of options available:
 
+**`HoodieWriteConfig`**:
+
+**TABLE_NAME** (Required)<br>
+
+
+**`DataSourceWriteOptions`**:
+
+**RECORDKEY_FIELD_OPT_KEY** (Required): Primary key field(s). Nested fields 
can be specified using the dot notation eg: `a.b.c`. When using multiple 
columns as primary key use comma separated notation, eg: 
`"col1,col2,col3,etc"`. Single or multiple columns as primary key specified by 
`KEYGENERATOR_CLASS_OPT_KEY` property.<br>
+Default value: `"uuid"`<br>
+
+**PARTITIONPATH_FIELD_OPT_KEY** (Required): Columns to be used for 
partitioning the table. To prevent partitioning, provide empty string as value 
eg: `""`. Specify partitioning/no partitioning using 
`KEYGENERATOR_CLASS_OPT_KEY` and if using hive 
`HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY`<br>
+Default value: `"partitionpath"`<br>
+
+**PRECOMBINE_FIELD_OPT_KEY** (Required): When two records have the same key 
value, the record with the largest value from the field specified will be 
choosen.<br>
+Default value: `"ts"`<br>
+
+**OPERATION_OPT_KEY**: The [write operations](#write-operations) to use.<br>
+Available values:<br>
+`UPSERT_OPERATION_OPT_VAL` (default), `BULK_INSERT_OPERATION_OPT_VAL`, 
`INSERT_OPERATION_OPT_VAL`, `DELETE_OPERATION_OPT_VAL`
+
+**TABLE_TYPE_OPT_KEY**: The [type of table](/docs/concepts.html#table-types) 
to write to. Note: After the initial creation of a table, this value must stay 
consistent when writing to (updating) the table using the Spark 
`SaveMode.Append` mode.<br>
+Available values:<br>
+[`COW_TABLE_TYPE_OPT_VAL`](/docs/concepts.html#copy-on-write-table) (default), 
[`MOR_TABLE_TYPE_OPT_VAL`](/docs/concepts.html#merge-on-read-table)
+
+**KEYGENERATOR_CLASS_OPT_KEY**: Key generator class, that will extract the key 
out of incoming record. If single column key use `SimpleKeyGenerator`. For 
multiple column keys use `ComplexKeyGenerator`. Note: A custom key generator 
class can be written/provided here as well. Primary key columns should be 
provided via `RECORDKEY_FIELD_OPT_KEY` option.<br>
+Available values:<br>
+`classOf[SimpleKeyGenerator].getName` (default), 
`classOf[NonpartitionedKeyGenerator].getName` (Non-partitioned tables can 
currently only have a single key column, 
[HUDI-1053](https://issues.apache.org/jira/browse/HUDI-1053)), 
`classOf[ComplexKeyGenerator].getName`
+
+
+**HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY**: Specify if the table should or 
should not be partitioned.<br>

Review comment:
       Also adding clarity that this is only necessary when using hive.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] afeldman1 commented on a change in pull request #1761: [MINOR] Add documentation for using multi-column table keys and for n…

Reply via email to