loneylee opened a new issue, #7542:
URL: https://github.com/apache/incubator-gluten/issues/7542

   ### Backend
   
   CH (ClickHouse)
   
   ### Bug description
   
   CUSTOMER表通过以下语句创建,并执行查询sql,使CUSTOMER表完全得到gluten hdfs cache
   
   ```
   create table IF NOT EXISTS SSB_STEP.customer_27010 as select * from 
SSB_STEP.CUSTOMER where C_REGION not in ('EUROPE');
   ```
   删除CUSTOMER表,再重新通过以下语句创建
   ```
   create table SSB_STEP.customer_27010 as select * from SSB_STEP.CUSTOMER;
   ```
   重新查询CUSTOMER就会出现不一致问题:
   
   如sql: 
   ```
   SELECT * FROM CUSTOMER_27010 where C_CUSTKEY=272
   ```
   
   Root Cause
   gluten hdfs cache 是通过文件名称生成hash key(ch做法)。 两次建表插入的数据文件名称相同,导致hash 
key也相同,第二次插入时,缓存误以为已经缓存了部分数据,把未缓存的部分继续缓存,导致数据缓存错误
   
   ### Spark version
   
   None
   
   ### Spark configurations
   
   _No response_
   
   ### System information
   
   _No response_
   
   ### Relevant logs
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to