[I] [CH] Optimize table error with spark standalone cluster [incubator-gluten]

via GitHub Wed, 07 Aug 2024 19:45:21 -0700


lwz9103 opened a new issue, #6750:
URL: https://github.com/apache/incubator-gluten/issues/6750


   ### Backend
   
   CH (ClickHouse)
   
   ### Bug description
   
   ### Reproduce sqls as follows:
   ```
   create database if not exist local;
   use local;
   CREATE EXTERNAL TABLE customer  (
    c_custkey    bigint not null,
    c_name       string not null,
    c_address    string not null,
    c_nationkey  bigint not null,
    c_phone      string not null,
    c_acctbal    double not null,
    c_mktsegment string not null,
    c_comment    string not null)
    USING PARQUET
   LOCATION 'file:///data/tpch100/customer';
   
   create database if not exist s3;
   use s3;
   CREATE EXTERNAL TABLE customer  (
   c_custkey    bigint not null ,
   c_name       string not null ,
   c_address    string not null ,
   c_nationkey  bigint not null ,
   c_phone      string not null ,
   c_acctbal    double not null ,
   c_mktsegment string not null ,
   c_comment    string not null )
   USING clickhouse
   CLUSTERED by (c_custkey) SORTED by  (c_custkey) INTO 45 BUCKETS
   TBLPROPERTIES (delta.checkpointInterval=5, storage_policy='__s3_main')
   LOCATION 
's3a://gluten-cicd/dataset/tpch100-mergetree-bucket-compact/customer';
   
   insert into s3.customer select * from local.customer order by c_custkey;
   optimize s3.customer
   ```
   
   ### Error msg
   
![image](https://github.com/user-attachments/assets/3690d77c-472e-4753-97f4-0be2d0957ac8)
   
   
   ### Debug info:
   
![AlcypYto2Y](https://github.com/user-attachments/assets/faa8910b-cd4f-4672-babf-676f687df863)
   
![img_v3_02dg_e67f2c4c-7df7-4885-934b-d9ddabfb67bg](https://github.com/user-attachments/assets/c452f50f-0e8a-4602-a601-fc87f76651d4)
   
   ### Root Cause
   After insert data to s3.customer, spark executor node will keep part of file 
mapping metadata, not all. If merge tasks contains part that file mapping not 
exist on current node, error occurs.
   
   
   ### Spark version
   
   None
   
   ### Spark configurations
   
   _No response_
   
   ### System information
   
   _No response_
   
   ### Relevant logs
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [CH] Optimize table error with spark standalone cluster [incubator-gluten]

Reply via email to