[I] [Bug] The Spark query on Paimon consumes a relatively large amount of memory, and there are also errors in field type conversion. [paimon]

via GitHub Wed, 18 Jun 2025 01:50:39 -0700


moranrr opened a new issue, #5773:
URL: https://github.com/apache/paimon/issues/5773


   ### Search before asking
   
   - [x] I searched in the [issues](https://github.com/apache/paimon/issues) 
and found nothing similar.
   
   
   ### Paimon version
   
   paimon：1.1.1
   
   ### Compute Engine
   
   flink:1.18.1
   spark 3.5.1
   
   
   ### Minimal reproduce step
   
   Flink writes data to table T_1 in real time.  The 'bucket' of table T_1 is 
'80', and 'snapshot. time-retained' is '12 h'.  Each partition has 
approximately 6 billion datas per day, and the distribution is relatively even. 
 When I use the following SparkSQL to count the number of data entries in each 
partition（dt is the partitioning field）:
   
   ```
   select dt,count(1) from paimon.test.data_detail group by dt
   ```
   
   1. If spark.driver.memory is set to 6G, the following error will occur: 
**Caused by: java.lang.OutOfMemoryError: Java heap space.**  
         Why does counting the number of records based on the date partition 
field require so much memory resources?
   
   2. If I increase the Spark driver memory, the query becomes extremely slow. 
After over ten minutes or even half an hour, the following error occurs:
   
   ```
   Caused by: java.lang.ClassCastException: org.apache.paimon.data.BinaryString 
cannot be cast to java.lang.Integer
   ```
   
   
   
   
   
   
   ### What doesn't meet your expectations?
   
   1.  Why would just counting the data volume in partitions require so much 
memory resources?
   
   2. Why would a type conversion error occur?
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@paimon.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [Bug] The Spark query on Paimon consumes a relatively large amount of memory, and there are also errors in field type conversion. [paimon]

Reply via email to