[jira] [Commented] (HUDI-1608) MOR fetches all records for read optimized query w/ spark sql

sivabalan narayanan (Jira) Thu, 11 Feb 2021 07:47:23 -0800


    [ 
https://issues.apache.org/jira/browse/HUDI-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283099#comment-17283099
 ]


sivabalan narayanan commented on HUDI-1608:
-------------------------------------------

I did try that as well Balaji. didn't work. I got a chance to sync up w/ Vinoth 
on this. He mentioned that if we have inserts coming in, read optimized may not 
work as expected. But if incoming batch only has updates, read optimized work 
as expected in MOR. COW works as expected in all cases. So, will keep this open 
for now. based on customer needs, we can revisit on how to fix this. 

> MOR fetches all records for read optimized query w/ spark sql
> -------------------------------------------------------------
>
>                 Key: HUDI-1608
>                 URL: https://issues.apache.org/jira/browse/HUDI-1608
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: Spark Integration
>    Affects Versions: 0.7.0
>            Reporter: sivabalan narayanan
>            Priority: Major
>              Labels: pull-request-available, sev:critical, user-support-issues
>
> Script to reproduce in local spark:
>  
> [https://gist.github.com/nsivabalan/7250b794788516f1aec35650c2632364]
>  
> ```
> scala> spark.sql("select _hoodie_commit_time, _hoodie_record_key, 
> _hoodie_partition_path, id, __op from hudi_trips_snapshot order by 
> _hoodie_record_key").show(false)
> +---------------------+----------------+++-------------------------+----
> |_hoodie_commit_time|_hoodie_record_key|_hoodie_partition_path|id|__op|
> +---------------------+----------------+++-------------------------+----
> |20210210070347    |1                |1970-01-01           |1 |null|
> |20210210070347    |2                |1970-01-01           |2 |null|
> |20210210070347    |3                |2020-01-04           |3 |D  |
> |20210210070347    |4                |1998-04-13           |4 |I  |
> |20210210070347    |5                |2020-01-01           |5 |I  |
> |*20210210070445*    |*6*                |*1998-04-13*           |*6* |*I*  |
> +---------------------+----------------+++-------------------------+----
> ```
> After an upsert, read optimized query returns records from both C1 and C2. 
> Also, I don't find any log files in partitions. all of them are parquet 
> files. 
>  
> ls /tmp/hudi_trips_cow/1998-04-13/
> 0d1e6a84-d036-42e9-806e-a3075b6bc677-0_1-23-12025_20210210065058.parquet
> 0d1e6a84-d036-42e9-806e-a3075b6bc677-0_1-61-25595_20210210065127.parquet
> ls /tmp/hudi_trips_cow/1970-01-01/
> 7b836833-a656-485d-967a-871bdc653dc3-0_2-61-25596_20210210065127.parquet
> 7b836833-a656-485d-967a-871bdc653dc3-0_3-23-12027_20210210065058.parquet
>  
> Source of the issue: [https://github.com/apache/hudi/issues/2255]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1608) MOR fetches all records for read optimized query w/ spark sql

Reply via email to