bithw1 opened a new issue, #12168:
URL: https://github.com/apache/hudi/issues/12168
I looked into the source code of Hudi 0.15,
```
@Deprecated
val SQL_INSERT_MODE: ConfigProperty[String] = ConfigProperty
.key("hoodie.sql.insert.mode")
.defaultValue("upsert")
.markAdvanced()
.deprecatedAfter("0.14.0")
.withDocumentation("Insert mode when insert data to pk-table. The
optional modes are: upsert, strict and non-strict." +
"For upsert mode, insert statement do the upsert operation for the
pk-table which will update the duplicate record." +
"For strict mode, insert statement will keep the primary key
uniqueness constraint which do not allow duplicate record." +
"While for non-strict mode, hudi just do the insert operation for the
pk-table. This config is deprecated as of 0.14.0. Please use " +
"hoodie.spark.sql.insert.into.operation and
hoodie.datasource.insert.dup.policy as you see fit.")
```
With this option,which is prior to hudi 0.14.0 ,the default behavior for
spark sql insert into statement is dong `upsert`. It is said that this option
is replaced with two options: `hoodie.spark.sql.insert.into.operation` and
`hoodie.datasource.insert.dup.policy`.
The definition for `hoodie.spark.sql.insert.into.operation` is as follows,
the default value for the spark.sql.insert.into.operation has been changed to
insert.
```
val SPARK_SQL_INSERT_INTO_OPERATION: ConfigProperty[String] =
ConfigProperty
.key("hoodie.spark.sql.insert.into.operation")
.defaultValue(WriteOperationType.INSERT.value())
.withValidValues(WriteOperationType.BULK_INSERT.value(),
WriteOperationType.INSERT.value(), WriteOperationType.UPSERT.value())
.markAdvanced()
.sinceVersion("0.14.0")
.withDocumentation("Sql write operation to use with INSERT_INTO spark
sql command. This comes with 3 possible values, bulk_insert, " +
"insert and upsert. bulk_insert is generally meant for initial loads
and is known to be performant compared to insert. But bulk_insert may not " +
"do small file management. If you prefer hudi to automatically manage
small files, then you can go with \"insert\". There is no precombine " +
"(if there are duplicates within the same batch being ingested, same
dups will be ingested) with bulk_insert and insert and there is no index " +
"look up as well. If you may use INSERT_INTO for mutable dataset, then
you may have to set this config value to \"upsert\". With upsert, you will " +
"get both precombine and updates to existing records on storage is
also honored. If not, you may see duplicates. ")
```
The definition for `hoodie.datasource.insert.dup.policy` is as follows, the
default value is `none`
```
val INSERT_DUP_POLICY: ConfigProperty[String] = ConfigProperty
.key("hoodie.datasource.insert.dup.policy")
.defaultValue(NONE_INSERT_DUP_POLICY)
.withValidValues(NONE_INSERT_DUP_POLICY, DROP_INSERT_DUP_POLICY,
FAIL_INSERT_DUP_POLICY)
.markAdvanced()
.sinceVersion("0.14.0")
.withDocumentation("**Note** This is only applicable to Spark SQL
writing.<br />When operation type is set to \"insert\", users can optionally
enforce a dedup policy. This policy will be employed "
+ " when records being ingested already exists in storage. Default
policy is none and no action will be taken. Another option is to choose " +
" \"drop\", on which matching records from incoming will be dropped
and the rest will be ingested. Third option is \"fail\" which will " +
"fail the write operation when same records are re-ingested. In other
words, a given record as deduced by the key generation policy " +
"can be ingested only once to the target table of interest.")
```
With above the options, it means that the default spark sql insert into
statement has been changed for `upsert` to `insert`(with defaults settings, may
introduce duplicates).
I am not sure whether I have understood correctly. If i am correct, then
this change is broken, some people using older version are using spark sql
insert into to do upsert which will not introduce duplicates, but after
upgrading to 0.14.0+, the default behavior is using insert which may introduce
duplicates.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]