[
https://issues.apache.org/jira/browse/HUDI-311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16982449#comment-16982449
]
Vinoth Chandar commented on HUDI-311:
-------------------------------------
This is what we get as parquet files on S3, for bulk load, Insert, Update,
Delete sequence off a MySQL CDC
{code}
scala>
spark.read.parquet("file:///home/vinoth/Downloads/LOAD00000001.parquet").show(10,
false)
+-------+----------+-------------+-------------+
|user_id|first_name|last_name |company |
+-------+----------+-------------+-------------+
|1 |vinoth |chandar |confluent inc|
|2 |balaji |varadarajan |uber |
|3 |sudha |saktheeswaran|uber |
+-------+----------+-------------+-------------+
scala>
spark.read.parquet("file:///home/vinoth/Downloads/20191126-124151666.parquet").show(10,
false)
+---+-------+----------+-----------+---------+
|Op |user_id|first_name|last_name |company |
+---+-------+----------+-----------+---------+
|I |4 |prasanna |rajaperumal|snowflake|
+---+-------+----------+-----------+---------+
scala>
spark.read.parquet("file:///home/vinoth/Downloads/20191126-124528981.parquet").show(10,
false)
+---+-------+----------+---------+-------+
|Op |user_id|first_name|last_name|company|
+---+-------+----------+---------+-------+
|U |1 |vinoth |chandar |???? |
+---+-------+----------+---------+-------+
scala>
spark.read.parquet("file:///home/vinoth/Downloads/20191126-125001909.parquet").show(10,
false)
+---+-------+----------+-----------+---------+
|Op |user_id|first_name|last_name |company |
+---+-------+----------+-----------+---------+
|D |4 |prasanna |rajaperumal|snowflake|
+---+-------+----------+-----------+---------+
scala>
{code}
We need
- a special payload implementation that looks at Op type and issues deletes
- Custom SQL transformer, that can add the OP column if not present (seems its
not present for the bulk load schema)
cc [~uditme] [~rbhartia] Seems doable.. Just FYI
> Support AWS DMS source on DeltaStreamer
> ---------------------------------------
>
> Key: HUDI-311
> URL: https://issues.apache.org/jira/browse/HUDI-311
> Project: Apache Hudi (incubating)
> Issue Type: New Feature
> Components: deltastreamer
> Reporter: Vinoth Chandar
> Assignee: Vinoth Chandar
> Priority: Major
> Fix For: 0.5.1
>
>
> https://aws.amazon.com/dms/ seems like a one-stop shop for database change
> logs.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)