[
https://issues.apache.org/jira/browse/HUDI-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17390829#comment-17390829
]
ASF GitHub Bot commented on HUDI-2208:
--------------------------------------
nsivabalan commented on a change in pull request #3328:
URL: https://github.com/apache/hudi/pull/3328#discussion_r680212711
##########
File path:
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala
##########
@@ -209,19 +209,32 @@ object InsertIntoHoodieTableCommand {
.getOrElse(INSERT_DROP_DUPS_OPT_KEY.defaultValue)
.toBoolean
- val operation = if (isOverwrite) {
- if (table.partitionColumnNames.nonEmpty) {
- INSERT_OVERWRITE_OPERATION_OPT_VAL // overwrite partition
- } else {
- INSERT_OPERATION_OPT_VAL
+ val enableBulkInsert =
parameters.getOrElse(DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.key,
+ DataSourceWriteOptions.SQL_ENABLE_BULK_INSERT.defaultValue()).toBoolean
+ val isPartitionedTable = table.partitionColumnNames.nonEmpty
+ val isPrimaryKeyTable = primaryColumns.nonEmpty
+ val operation =
+ (isPrimaryKeyTable, enableBulkInsert, isOverwrite, dropDuplicate) match {
+ case (true, true, _, _) =>
+ throw new IllegalArgumentException(s"Table with primaryKey can not
use bulk insert.")
+ case (_, true, true, _) if isPartitionedTable =>
+ throw new IllegalArgumentException(s"Insert Overwrite Partition can
not use bulk insert.")
+ case (_, true, _, true) =>
+ throw new IllegalArgumentException(s"Bulk insert cannot support drop
duplication." +
+ s" Please disable $INSERT_DROP_DUPS_OPT_KEY and try again.")
+ // if enableBulkInsert is true, use bulk insert for the insert
overwrite non-partitioned table.
+ case (_, true, true, _) if !isPartitionedTable =>
BULK_INSERT_OPERATION_OPT_VAL
+ // insert overwrite partition
+ case (_, _, true, _) if isPartitionedTable =>
INSERT_OVERWRITE_OPERATION_OPT_VAL
+ // insert overwrite table
+ case (_, _, true, _) if !isPartitionedTable =>
INSERT_OVERWRITE_TABLE_OPERATION_OPT_VAL
+ // if the table has primaryKey and the dropDuplicate has disable, use
the upsert operation
+ case (true, false, false, false) => UPSERT_OPERATION_OPT_VAL
+ // if enableBulkInsert is true and the table is non-primaryKeyed, use
the bulk insert operation
+ case (false, true, _, _) => BULK_INSERT_OPERATION_OPT_VAL
+ // for the rest case, use the insert operation
+ case (_, _, _, _) => INSERT_OPERATION_OPT_VAL
Review comment:
Here is my thought on choosing the right operation. Having too many case
statements might complicate things and is error prone too. As I mentioned
earlier, we should try to do any valid conversions in HoodiesSparkSqlWriter.
Only those thats applicable just to sql dml, we should keep it here.
Anyways, here is one simplified approach. Ignoring the primary, non primary
key table for now. We can come back to that later once we have consensus on
this.
We need just two configs.
hoodie.sql.enable.bulk_insert (default false)
hoodie.sql.overwrite.entire.table (default true)
From sql syntax, there are two commands allowed.
"INSERT" into and "INSERT OVERWRITE". And these need to map to 4 operations
on the hudi end (insert, bulk_insert, insert over write and insert overwrite
table)
"INSERT" with no other configs set -> insert operation
"INSERT" with enable bulk insert set -> bulk_insert
"INSERT OVERWRITE" with no other configs set -> insert_overwrite_table
operation
"INSERT OVERWRITE" with hoodie.sql.overwrite.entire.table = false ->
insert_overwrite operation.
"INSERT OVERWRITE" with enable bulk_insert set -> bulk_insert. pass the
right save mode to HoodieSparkSqlWriter
"INSERT OVERWRITE" with enable bulk_insert set and
hoodie.sql.overwrite.entire.table = false -> bulk_insert. pass the right save
mode to HoodieSparkSqlWriter.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> [SQL] Support Bulk Insert For Spark Sql
> ---------------------------------------
>
> Key: HUDI-2208
> URL: https://issues.apache.org/jira/browse/HUDI-2208
> Project: Apache Hudi
> Issue Type: Sub-task
> Reporter: pengzhiwei
> Assignee: pengzhiwei
> Priority: Blocker
> Labels: pull-request-available, release-blocker
>
> Support the bulk insert for spark sql
--
This message was sent by Atlassian Jira
(v8.3.4#803005)