[jira] [Updated] (HUDI-1608) MOR fetches all records for read optimized query w/ spark sql

sivabalan narayanan (Jira) Wed, 10 Feb 2021 05:21:06 -0800


     [ 
https://issues.apache.org/jira/browse/HUDI-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


sivabalan narayanan updated HUDI-1608:
--------------------------------------
    Description: 
Script to reproduce in local spark:

 

[https://gist.github.com/nsivabalan/7250b794788516f1aec35650c2632364]

 

```

scala> spark.sql("select _hoodie_commit_time, _hoodie_record_key, 
_hoodie_partition_path, id, __op from hudi_trips_snapshot order by 
_hoodie_record_key").show(false)

+---------------------+----------------+++-------------------------+----
|_hoodie_commit_time|_hoodie_record_key|_hoodie_partition_path|id|__op|

+---------------------+----------------+++-------------------------+----
|20210210070347    |1                |1970-01-01           |1 |null|
|20210210070347    |2                |1970-01-01           |2 |null|
|20210210070347    |3                |2020-01-04           |3 |D  |
|20210210070347    |4                |1998-04-13           |4 |I  |
|20210210070347    |5                |2020-01-01           |5 |I  |
|*20210210070445*    |*6*                |*1998-04-13*           |*6* |*I*  |

+---------------------+----------------+++-------------------------+----

```

After an upsert, read optimized query returns records from both C1 and C2. 

Also, I don't find any log files in partitions. all of them are parquet files. 

 

ls /tmp/hudi_trips_cow/1998-04-13/

0d1e6a84-d036-42e9-806e-a3075b6bc677-0_1-23-12025_20210210065058.parquet

0d1e6a84-d036-42e9-806e-a3075b6bc677-0_1-61-25595_20210210065127.parquet

ls /tmp/hudi_trips_cow/1970-01-01/

7b836833-a656-485d-967a-871bdc653dc3-0_2-61-25596_20210210065127.parquet

7b836833-a656-485d-967a-871bdc653dc3-0_3-23-12027_20210210065058.parquet

 

Source of the issue: [https://github.com/apache/hudi/issues/2255]

 

 

  was:
Script to reproduce in local spark:

[https://gist.github.com/nsivabalan/7250b794788516f1aec35650c2632364]

 

```

scala> spark.sql("select _hoodie_commit_time, _hoodie_record_key, 
_hoodie_partition_path, id, __op from hudi_trips_snapshot order by 
_hoodie_record_key").show(false)

+--------------------+-----------------++-------------------------++----
|_hoodie_commit_time|_hoodie_record_key|_hoodie_partition_path|id|__op|

+--------------------+-----------------++-------------------------++----
|20210210070347    |1                |1970-01-01           |1 |null|
|20210210070347    |2                |1970-01-01           |2 |null|
|20210210070347    |3                |2020-01-04           |3 |D  |
|20210210070347    |4                |1998-04-13           |4 |I  |
|20210210070347    |5                |2020-01-01           |5 |I  |
|*20210210070445*    |*6*                |*1998-04-13*           |*6* |*I*  |

+--------------------+-----------------++-------------------------++----

```

After an upsert, read optimized query returns records from both C1 and C2. 

Also, I don't find any log files in partitions. all of them are parquet files. 

 

ls /tmp/hudi_trips_cow/1998-04-13/

0d1e6a84-d036-42e9-806e-a3075b6bc677-0_1-23-12025_20210210065058.parquet

0d1e6a84-d036-42e9-806e-a3075b6bc677-0_1-61-25595_20210210065127.parquet

ls /tmp/hudi_trips_cow/1970-01-01/

7b836833-a656-485d-967a-871bdc653dc3-0_2-61-25596_20210210065127.parquet

7b836833-a656-485d-967a-871bdc653dc3-0_3-23-12027_20210210065058.parquet

 

Source of the issue: [https://github.com/apache/hudi/issues/2255]

 

 


> MOR fetches all records for read optimized query w/ spark sql
> -------------------------------------------------------------
>
>                 Key: HUDI-1608
>                 URL: https://issues.apache.org/jira/browse/HUDI-1608
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: Spark Integration
>    Affects Versions: 0.7.0
>            Reporter: sivabalan narayanan
>            Priority: Major
>              Labels: sev:critical, user-support-issues
>
> Script to reproduce in local spark:
>  
> [https://gist.github.com/nsivabalan/7250b794788516f1aec35650c2632364]
>  
> ```
> scala> spark.sql("select _hoodie_commit_time, _hoodie_record_key, 
> _hoodie_partition_path, id, __op from hudi_trips_snapshot order by 
> _hoodie_record_key").show(false)
> +---------------------+----------------+++-------------------------+----
> |_hoodie_commit_time|_hoodie_record_key|_hoodie_partition_path|id|__op|
> +---------------------+----------------+++-------------------------+----
> |20210210070347    |1                |1970-01-01           |1 |null|
> |20210210070347    |2                |1970-01-01           |2 |null|
> |20210210070347    |3                |2020-01-04           |3 |D  |
> |20210210070347    |4                |1998-04-13           |4 |I  |
> |20210210070347    |5                |2020-01-01           |5 |I  |
> |*20210210070445*    |*6*                |*1998-04-13*           |*6* |*I*  |
> +---------------------+----------------+++-------------------------+----
> ```
> After an upsert, read optimized query returns records from both C1 and C2. 
> Also, I don't find any log files in partitions. all of them are parquet 
> files. 
>  
> ls /tmp/hudi_trips_cow/1998-04-13/
> 0d1e6a84-d036-42e9-806e-a3075b6bc677-0_1-23-12025_20210210065058.parquet
> 0d1e6a84-d036-42e9-806e-a3075b6bc677-0_1-61-25595_20210210065127.parquet
> ls /tmp/hudi_trips_cow/1970-01-01/
> 7b836833-a656-485d-967a-871bdc653dc3-0_2-61-25596_20210210065127.parquet
> 7b836833-a656-485d-967a-871bdc653dc3-0_3-23-12027_20210210065058.parquet
>  
> Source of the issue: [https://github.com/apache/hudi/issues/2255]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1608) MOR fetches all records for read optimized query w/ spark sql

Reply via email to