ehurheap commented on issue #6194:
URL: https://github.com/apache/hudi/issues/6194#issuecomment-1212230365

   @KnightChess , yes there is duplicate data when running a query on the data:
   
   ```
     val path="s3://<bucketpath>/tables/events"
     val events = 
spark.read.format("hudi").option("hoodie.datasource.query.type", 
"read_optimized").load(path)
   
     events.createOrReplaceTempView("events")
     val dupeQuery =
       """select env_id, event_id, user_id, count(*) from events
         | where env_id = 123 and week = '20220711'
         | group by env_id, event_id, user_id
         | having count(*) > 1
         |""".stripMargin
   
   val res = spark.sql(dupeQuery)
   res: org.apache.spark.sql.DataFrame = [env_id: bigint, event_id: bigint ... 
2 more fields]
   
     scala> res.show
     +---------+----------------+----------------+--------+
     |   env_id|        event_id|         user_id|count(1)|
     +---------+----------------+----------------+--------+
     |      123|4401289435098557|3813718218593807|       2|
     |      123|7627782625576713|4299498150167280|       2|
     |      123|7972131523381176|4461192992664821|       2|
   ...
     +---------+----------------+----------------+--------+
     only showing top 20 rows
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to