bvaradar commented on issue #2149:
URL: https://github.com/apache/hudi/issues/2149#issuecomment-706764303
@ashishmgofficial : It looks like the json data and the avro schema are not
matching correctly. When I read the file through spark directly (please see
below), I am getting an different schema than the one you provided. This is
because debezium is configured to write in "JSON_SCHEMA" mode which I think is
the default. This has both data and schema inlined and is inefficient in space.
Since you are actually managing avro schemas, can you configure Debezium to
write avro records directly rather than json. In my experiments (with a custom
schema), I saw 8x speeded in Debezium by changing the format from json_schema
to avro. If you still want to write as json, disable inline schema by setting
the below debezium configs to false:
key.converter.schemas.enable
value.converter.schemas.enable
==========
scala> val df = spark.read.json("file:///var/hoodie/ws/docker/inp.json")
df: org.apache.spark.sql.DataFrame = [after: struct<Value:
struct<case_individual_id: struct<int: bigint>, flag: struct<string: string>
... 5 more fields>>, before: string ... 4 more fields]
scala> df.printSchema()
root
|-- after: struct (nullable = true)
| |-- Value: struct (nullable = true)
| | |-- case_individual_id: struct (nullable = true)
| | | |-- int: long (nullable = true)
| | |-- flag: struct (nullable = true)
| | | |-- string: string (nullable = true)
| | |-- inc_id: long (nullable = true)
| | |-- last_modified_ts: long (nullable = true)
| | |-- violation_code: struct (nullable = true)
| | | |-- string: string (nullable = true)
| | |-- violation_desc: struct (nullable = true)
| | | |-- string: string (nullable = true)
| | |-- year: struct (nullable = true)
| | | |-- int: long (nullable = true)
|-- before: string (nullable = true)
|-- op: string (nullable = true)
|-- source: struct (nullable = true)
| |-- connector: string (nullable = true)
| |-- db: string (nullable = true)
| |-- lsn: struct (nullable = true)
| | |-- long: long (nullable = true)
| |-- name: string (nullable = true)
| |-- schema: string (nullable = true)
| |-- snapshot: struct (nullable = true)
| | |-- string: string (nullable = true)
| |-- table: string (nullable = true)
| |-- ts_ms: long (nullable = true)
| |-- txId: struct (nullable = true)
| | |-- long: long (nullable = true)
| |-- version: string (nullable = true)
| |-- xmin: string (nullable = true)
|-- transaction: string (nullable = true)
|-- ts_ms: struct (nullable = true)
| |-- long: long (nullable = true)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]