[GitHub] [hudi] sbernauer opened a new issue #2224: Incremental query on COW table is missing data

GitBox Tue, 03 Nov 2020 05:41:51 -0800


sbernauer opened a new issue #2224:
URL: https://github.com/apache/hudi/issues/2224



   **Describe the problem you faced**
   
   Hi hudi team,
   
   we have a HoodieDeltaStreamer streaming from a kafka with avro records into 
a COW table on S3. It uses a timebased partitioning and has data since about 2 
months (2 billion records).
   
   ```
   HUDI_SOURCE_RECORD_KEY: 
"sourceEventHeader.happenedTimestamp,sourceEventHeader.eventId"
   HUDI_SOURCE_PARTITION_PATH: "sourceEventHeader.happenedTimestamp:TIMESTAMP"
   HUDI_SOURCE_ORDERING_FIELD: "sourceEventHeader.happenedTimestamp"
   HUDI_OP: "INSERT"
   HUDI_DROP_DUPLICATES: "true"
   HUDI_TABLE_TYPE: "COPY_ON_WRITE"
   ```
   
   ```
   
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
   hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS
   hoodie.deltastreamer.keygen.timebased.input.timezone=UTC
   hoodie.deltastreamer.keygen.timebased.output.timezone=Europe/Berlin
   hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd
   ```
   
   When using a snapshot-Query everything works fine, however when trying an 
incremental query, things get strange.
   
   ```
   incremental_read_options = {
       'hoodie.datasource.query.type':             'incremental',
       'hoodie.datasource.read.begin.instanttime': '20100000000000',
       'hoodie.datasource.read.end.instanttime':   '20300000000000',
       'hoodie.datasource.read.incr.path.glob':    '' # Default value
   }
   
   hudiIncQueryDF = spark.read.format("hudi") \
       .options(**incremental_read_options) \
       .load("s3a://<myevent>-v1.live.bs/events/*/*/*/*")
   
   print(hudiIncQueryDF.count())
   
hudiIncQueryDF.groupBy("_hoodie_partition_path").count().orderBy(f.desc("_hoodie_partition_path")).show(10)
   
hudiIncQueryDF.groupBy("_hoodie_file_name").count().orderBy(f.desc("count")).show(10,
 False)
   
   2289893
   +----------------------+-------+
   |_hoodie_partition_path|  count|
   +----------------------+-------+
   |            2020/11/02|2289893|
   +----------------------+-------+
   
   
+-----------------------------------------------------------------------------+------+
   |_hoodie_file_name                                                           
 |count |
   
+-----------------------------------------------------------------------------+------+
   
|4c76332b-bebd-4d81-aefc-67c16476c27c-0_0-111193-819648_20201102092152.parquet|106409|
   
|363ae91d-5b96-41e4-92ac-0ca4f230bac3-0_0-111929-825141_20201102095755.parquet|100267|
   
|363ae91d-5b96-41e4-92ac-0ca4f230bac3-0_0-111806-824223_20201102095150.parquet|94803
 |
   
|4c76332b-bebd-4d81-aefc-67c16476c27c-0_0-111314-820561_20201102092749.parquet|91442
 |
   
|4c76332b-bebd-4d81-aefc-67c16476c27c-0_0-111396-821171_20201102093151.parquet|90420
 |
   
|4c76332b-bebd-4d81-aefc-67c16476c27c-0_0-111560-822391_20201102093950.parquet|87260
 |
   
|4c76332b-bebd-4d81-aefc-67c16476c27c-0_0-111273-820256_20201102092549.parquet|86747
 |
   
|4c76332b-bebd-4d81-aefc-67c16476c27c-0_0-111029-818428_20201102091350.parquet|86260
 |
   
|363ae91d-5b96-41e4-92ac-0ca4f230bac3-0_0-112009-825751_20201102100152.parquet|85827
 |
   
|4c76332b-bebd-4d81-aefc-67c16476c27c-0_0-111111-819038_20201102091749.parquet|85435
 |
   
+-----------------------------------------------------------------------------+------+
   ```
   
   **To Reproduce**
   
   See code
   
   **Expected behavior**
   
   I would expect, that all ~60 days (with coresponding partition) would been 
shown (in total all 2 billion records).
   The exact same code, only replacing 'incremental' with 'snapshot' produces 
the expected result.
   Thanks a lot in advance!
   
   ```
   2046927254
   +----------------------+--------+
   |_hoodie_partition_path|   count|
   +----------------------+--------+
   |            2020/11/02|10777825|
   |            2020/11/01|28509594|
   |            2020/10/31|26127712|
   |            2020/10/30|35416532|
   |            2020/10/29|35729284|
   |            2020/10/28|32331441|
   |            2020/10/27|31593641|
   |            2020/10/26|40874884|
   |            2020/10/25|44914370|
   |            2020/10/24|41641045|
   +----------------------+--------+
   
   
+----------------------------------------------------------------------------+-------+
   |_hoodie_file_name                                                           
|count  |
   
+----------------------------------------------------------------------------+-------+
   |1d52ae83-b044-4024-b699-c97c7cfdeb3d-0_0-4812-39321_20200929171013.parquet  
|2000000|
   |4e383b84-fe4d-468a-9ae9-79a86e15dca2-0_0-1627-12826_20201011184143.parquet  
|2000000|
   
|4c96129e-5136-4d65-97f7-0c0374a64d63-0_0-12625-107190_20200930060544.parquet|1999999|
   |6be0ac67-2aa6-4a73-a002-9633e6b1fdcd-0_0-8870-72067_20200929235515.parquet  
|1999999|
   |04066538-0fbb-48b3-a234-8136b9b8baea-0_0-11334-96754_20200930035756.parquet 
|1999999|
   |ae309bfb-5e4b-4ff5-9552-375830514c1c-0_0-1536-12474_20200929113828.parquet  
|1999999|
   |79d7f72a-42ac-4654-9645-bbe6a3e4bc9e-0_0-8441-68624_20200929231226.parquet  
|1999999|
   |bef99bc3-51c1-47d1-b7fa-c8e7bc64cdac-0_0-2294-18053_20201011194252.parquet  
|1999999|
   |2fd343a1-25b9-4cc4-ac68-e382e42a7496-0_0-11100-94868_20200930033558.parquet 
|1999999|
   |d09aac51-f2fa-4917-a556-7118b86c507a-0_0-4695-38310_20200929165815.parquet  
|1999999|
   
+----------------------------------------------------------------------------+-------+
   ```
   
   **Environment Description**
   
   * Hudi version : 0.6.0
   
   * Spark version : 3.0.0
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : yes
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] sbernauer opened a new issue #2224: Incremental query on COW table is missing data

Reply via email to