GitHub user xuchuanyin opened a pull request:

    https://github.com/apache/carbondata/pull/1217

    [CARBONDATA-1345] Update tablemeta cache after table schema has been changed

    # Scenario
    
    ## Steps to reproduce
    
    Start 2 spark-beeline as two different sessions, do the following steps in 
corresponding session:
    
    (SESSION1)
    
    1. create table T_Carbn01(Active_status String,Item_type_cd INT,Qty_day_avg 
INT,Qty_total INT,Sell_price BIGINT,Sell_pricep DOUBLE,Discount_price 
DOUBLE,Profit DECIMAL(3,2),Item_code String,Item_name String,Outlet_name 
String,Update_time TIMESTAMP,Create_date String)STORED BY 
'org.apache.carbondata.format' TBLPROPERTIES('table_blocksize'='128');
    
    2. LOAD DATA INPATH 'hdfs://hacluster/user/Ram/T_Hive1.csv' INTO table 
T_Carbn01 options ('DELIMITER'=',', 
'QUOTECHAR'='\','BAD_RECORDS_LOGGER_ENABLE'='true', 
'BAD_RECORDS_ACTION'='REDIRECT', 
'FILEHEADER'='Active_status,Item_type_cd,Qty_day_avg,Qty_total,Sell_price,Sell_pricep,Discount_price,Profit,Item_code,Item_name,Outlet_name,Update_time,Create_date');
    
    (SESSION2):
    
    1. update t_carbn01 set(Active_status) = ('TRUE') where Item_type_cd = 41;
    
    (SESSION1):
    
    1. Drop table t_carbn01;
    
    2. create table T_Carbn01(Active_status String,Item_type_cd INT,Qty_day_avg 
INT,Qty_total INT,Sell_price BIGINT,Sell_pricep DOUBLE,Discount_price 
DOUBLE,Profit DECIMAL(3,2),Item_code String,Item_name String,Outlet_name 
String,Update_time TIMESTAMP,Create_date String)STORED BY 
'org.apache.carbondata.format' TBLPROPERTIES('table_blocksize'='128');
    
    3. LOAD DATA INPATH 'hdfs://hacluster/user/Ram/T_Hive1.csv' INTO table 
T_Carbn01 options ('DELIMITER'=',', 
'QUOTECHAR'='\','BAD_RECORDS_LOGGER_ENABLE'='true', 
'BAD_RECORDS_ACTION'='REDIRECT', 
'FILEHEADER'='Active_status,Item_type_cd,Qty_day_avg,Qty_total,Sell_price,Sell_pricep,Discount_price,Profit,Item_code,Item_name,Outlet_name,Update_time,Create_date');
    
    (SESSION2):
    1. update t_carbn01 set(Active_status) = ('TRUE') where Item_type_cd = 41;
    
    ## Outputs
    
    message are as below:
    
    ```
    Error: java.lang.RuntimeException: Update operation failed. Job aborted due 
to stage failure: Task 0 in stage 14.0 failed 4 times, most recent failure: 
Lost task 0.3 in stage 14.0 (TID 29, master, executor 2): java.io.IOException: 
java.io.IOException: Dictionary file does not exist: 
hdfs://user/hive/warehouse/carbon.store/default/t_carbn01/Metadata/ddfb3bc8-2fea-41fe-a4ff-18588df41aec.dictmeta
        at 
org.apache.carbondata.core.cache.dictionary.ForwardDictionaryCache.getAll(ForwardDictionaryCache.java:146)
        at 
org.apache.spark.sql.DictionaryLoader.loadDictionary(CarbonDictionaryDecoder.scala:686)
        at 
org.apache.spark.sql.DictionaryLoader.getDictionary(CarbonDictionaryDecoder.scala:703)
        at 
org.apache.spark.sql.ForwardDictionaryWrapper.getDictionaryValueForKeyInBytes(CarbonDictionaryDecoder.scala:654)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:378)
        at 
org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:132)
        at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
        at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
        at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
        at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972)
        at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032)
        at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:715)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    ```
    
    ## Input data
    sample for input data:
    ```
    TRUE,2,423,3046340,200000000003454300, 
121.5,4.99,2.44,SE3423ee,asfdsffdfg,EtryTRWT,2012-01-12 
03:14:05.123456729,2012-01-20
    TRUE,3,453,3003445,200000000000003450, 
121.5,4.99,2.44,SE3423ee,asfdsffdfg,ERTEerWT,2012-01-13 
03:24:05.123456739,2012-01-20
    TRUE,4,4350,3044364,200000000000000000, 
121.5,4.99,2.44,SE3423ee,asfdsffdfg,ERTtryWT,2012-01-14 
23:03:05.123456749,2012-01-20
    TRUE,114,4520,30000430,200000000004300000, 
121.5,4.99,2.44,RE3423ee,asfdsffdfg,4RTETRWT,2012-01-01 
23:02:05.123456819,2012-01-20
    FALSE,123,454,30000040,200000000000000000, 
121.5,4.99,2.44,RE3423ee,asfrewerfg,6RTETRWT,2012-01-02 
23:04:05.123456829,2012-01-20
    TRUE,11,4530,3000040,200000000000000000, 
121.5,4.99,2.44,SE3423ee,asfdsffder,TRTETRWT,2012-01-03 
05:04:05.123456839,2012-01-20
    TRUE,14,4590,3000400,200000000000000000, 
121.5,4.99,2.44,ASD423ee,asfertfdfg,HRTETRWT,2012-01-04 
05:06:05.123456849,2012-01-20
    FALSE,41,4250,00000,200000000000000000, 
121.5,4.99,2.44,SAD423ee,asrtsffdfg,HRTETRWT,2012-01-05 
05:07:05.123456859,2012-01-20
    TRUE,13,4510,30400,200000000000000000, 
121.5,4.99,2.44,DE3423ee,asfrtffdfg,YHTETRWT,2012-01-06 
06:08:05.123456869,2012-01-20
    ```
    
    
    # Analyze
    
    In the error message, it says the dictmeta doesnot exist.
    Actually this file is generated during the first load operation in 
SESSION1,And the tablemeta is cached in SESSION2 when doing update operation in 
SESSION2. After DELETE-LOAD operation in SESSION1, old dictionary files has 
been deleted and new dictionary files are generated in SESSION1. But in 
SESSION2, when doing update operation, we still use the outdated tablemeta from 
cache which refers to the dictmeta that were outdated, thus causing the error.
    
    To solve this problem, we should refresh the cache for tableMeta when the 
corresponding data schema has been updated.
    
    
    # Solution
    
    Refresh the tablemeta cache when table schema has been changed.
    
    Since HiveSessionState.lookupRelation is slow(especially in concurrent 
query scenario), dont call this method when table schema has not been changed.
    
    # Notes
    
    I've tested the scenario in my environment and it is OK.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/xuchuanyin/carbondata carbondata-1345

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/1217.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1217
    
----
commit 14b1f5c91291a3eefc2bb4d978b044fd9682141d
Author: xuchuanyin <[email protected]>
Date:   2017-07-31T04:06:51Z

    Fix bugs when alter table in different session

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to