[GitHub] [hudi] bvaradar commented on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

GitBox Sun, 11 Oct 2020 13:32:36 -0700


bvaradar commented on issue #2149:
URL: https://github.com/apache/hudi/issues/2149#issuecomment-706764303



   @ashishmgofficial : It looks like the json data and the avro schema are not 
matching correctly.  When I read the file through spark directly (please see 
below), I am getting an different schema than the one you provided. This is 
because debezium is configured to write in "JSON_SCHEMA" mode which I think is 
the default. This has both data and schema inlined and is inefficient in space.
   
   Since you are actually managing avro schemas, can you configure Debezium to 
write avro records directly rather than json. In my experiments (with a custom 
schema), I saw 8x speeded in Debezium by changing the format from json_schema 
to avro. If you still want to write as json, disable inline schema by setting 
the below debezium configs to false: 
       key.converter.schemas.enable
      value.converter.schemas.enable
   
   
   ==========
   
   scala> val df = spark.read.json("file:///var/hoodie/ws/docker/inp.json")
   df: org.apache.spark.sql.DataFrame = [after: struct<Value: 
struct<case_individual_id: struct<int: bigint>, flag: struct<string: string> 
... 5 more fields>>, before: string ... 4 more fields]
   
   scala> df.printSchema()
   root
    |-- after: struct (nullable = true)
    |    |-- Value: struct (nullable = true)
    |    |    |-- case_individual_id: struct (nullable = true)
    |    |    |    |-- int: long (nullable = true)
    |    |    |-- flag: struct (nullable = true)
    |    |    |    |-- string: string (nullable = true)
    |    |    |-- inc_id: long (nullable = true)
    |    |    |-- last_modified_ts: long (nullable = true)
    |    |    |-- violation_code: struct (nullable = true)
    |    |    |    |-- string: string (nullable = true)
    |    |    |-- violation_desc: struct (nullable = true)
    |    |    |    |-- string: string (nullable = true)
    |    |    |-- year: struct (nullable = true)
    |    |    |    |-- int: long (nullable = true)
    |-- before: string (nullable = true)
    |-- op: string (nullable = true)
    |-- source: struct (nullable = true)
    |    |-- connector: string (nullable = true)
    |    |-- db: string (nullable = true)
    |    |-- lsn: struct (nullable = true)
    |    |    |-- long: long (nullable = true)
    |    |-- name: string (nullable = true)
    |    |-- schema: string (nullable = true)
    |    |-- snapshot: struct (nullable = true)
    |    |    |-- string: string (nullable = true)
    |    |-- table: string (nullable = true)
    |    |-- ts_ms: long (nullable = true)
    |    |-- txId: struct (nullable = true)
    |    |    |-- long: long (nullable = true)
    |    |-- version: string (nullable = true)
    |    |-- xmin: string (nullable = true)
    |-- transaction: string (nullable = true)
    |-- ts_ms: struct (nullable = true)
    |    |-- long: long (nullable = true)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] bvaradar commented on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

Reply via email to