[I] [Spark|Question] Partition limit query comparison of paimon primary key table and non-Paimon table [paimon]

via GitHub Thu, 04 Jul 2024 18:58:44 -0700


MOBIN-F opened a new issue, #3676:
URL: https://github.com/apache/paimon/issues/3676


   ### Search before asking
   
   - [X] I searched in the [issues](https://github.com/apache/paimon/issues) 
and found nothing similar.
   
   
   ### Paimon version
   
   paimon-spark-3.3-0.8
   
   ### Compute Engine
   
   Spark 3.3.2
   
   ### Minimal reproduce step
   
   none
   
   ### What doesn't meet your expectations?
   
   We have a Paimon primary key table and a non-Paimon table with the same 
data. We found that in the query [where pt=20240530 limit 10], the Paimon 
primary key table is much slower than the non-Paimon table.
   **paimon-pk table:** 
   paimon TBLPROPERTIES
   ```
     "options" : {
       "bucket" : "1",
       "num-sorted-run.stop-trigger" : "2147483647",
       "changelog-producer" : "none",
       "snapshot.num-retained.max" : "3",
       "snapshot.num-retained.min" : "1",
       "sink.parallelism" : "5",
       "deletion-vectors.enabled" : "true",
       "compaction.optimization-interval" : "10",
       "sort-spill-threshold" : "10"
     },
   ```
   [select * from paimon_catalog.rt_ods.paimon_xxxx_d where pt=20240530  limit 
10]
   
![image](https://github.com/apache/paimon/assets/11769953/83b1e0f2-f0fe-4cd3-a76a-11ed3333c445)
   
![image](https://github.com/apache/paimon/assets/11769953/f0c42762-17cf-40d3-a908-d3adb35b6526)
   
![image](https://github.com/apache/paimon/assets/11769953/81a63f75-ddee-42e8-8ee8-22e0b4b30e02)
   
   ```
   $hadoop fs -ls -t -r /warehouse/rt_ods/paimon_uoc_order_main_d/pt=20240530/*
   Found 5 items
   -rwxrwx--x+  2 hive supergroup  134829818 2024-07-04 23:53 
/warehouse/rt_ods/paimon_xxxx_d/pt=20240530/bucket-0/data-c7aa5c7d-0f77-4567-abc3-066bdbd677b1-2.orc
   -rwxrwx--x+  2 hive supergroup  135744147 2024-07-05 08:03 
/warehouse/rt_ods/paimon_xxxx_d/pt=20240530/bucket-0/data-c7aa5c7d-0f77-4567-abc3-066bdbd677b1-11.orc
   -rwxrwx--x+  2 hive supergroup  134621349 2024-07-05 08:50 
/warehouse/rt_ods/paimon_xxxx_d/pt=20240530/bucket-0/data-c7aa5c7d-0f77-4567-abc3-066bdbd677b1-12.orc
   -rwxrwx--x+  2 hive supergroup   49324398 2024-07-05 08:50 
/warehouse/rt_ods/paimon_xxxx_d/pt=20240530/bucket-0/data-c7aa5c7d-0f77-4567-abc3-066bdbd677b1-13.orc
   -rwxrwx--x+  2 hive supergroup       8976 2024-07-05 09:26 
/warehouse/rt_ods/paimon_xxxx_d/pt=20240530/bucket-0/data-39bcc9ec-bdac-4e1a-96bb-1e8a2ec96b3d-10.orc
   ```
   ```
   $hadoop fs -du -s -h /warehouse/rt_ods/paimon_xxxx_d/pt=20240530/
   433.5 M  866.9 M  /warehouse/rt_ods/paimon_xxxx_d/pt=20240530
   ```
   count(1) where pt=20240530
   
![image](https://github.com/apache/paimon/assets/11769953/38b0435b-5913-4d28-9264-632dfe490531)
   
   **non-Paimon table (parquet format):**
   [select * from dw_ods.tdb_xxxx_d where pt=20240530 limit 10]
   
![image](https://github.com/apache/paimon/assets/11769953/03430d9b-2e1b-419d-9eda-738a8675af83)
   
![image](https://github.com/apache/paimon/assets/11769953/fc6d1e5d-72e7-4d8c-b454-2eb3627bd1f4)
   
   ```
   $hadoop fs -du -s -h /ods/tdb_xxxx_d/pt=20240530
   846.4 M  1.7 G  /ods/tdb_xxxx_d/pt=20240530
   ```
   
   When the file size and number of entries are similar, the limit query 
performance of paimon seems to be lower than that of non-paimon tables, as if 
limit does not work?
   
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Spark|Question] Partition limit query comparison of paimon primary key table and non-Paimon table [paimon]

Reply via email to