[
https://issues.apache.org/jira/browse/CARBONDATA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146793#comment-16146793
]
Prabhat Kashyap commented on CARBONDATA-1387:
---------------------------------------------
Hello Vandana,
This is not the bug of partition but the problem is data is being inserted into
one table from another. If you insert data from another table then schema of
both table should match otherwise there are chances of getting bad records,
that is what is happening here. The data goes into the bad record (null) and
the select query is showing the different result as expected.
*Possible solution:*
# Throw an exception while inserting data and schema mismatch.
# Continue with the bad record inserted into the table when schema mismatched.
> Incorrect partition creation while inserting data from another table
> --------------------------------------------------------------------
>
> Key: CARBONDATA-1387
> URL: https://issues.apache.org/jira/browse/CARBONDATA-1387
> Project: CarbonData
> Issue Type: Bug
> Components: data-query
> Affects Versions: 1.2.0
> Environment: spark 2.1
> Reporter: Vandana Yadav
> Assignee: Prabhat Kashyap
>
> Incorrect partition creation while inserting data from another table.
> Description: While inserting data from another table no of rows in each
> partition remain same although no of rows in partitioned table get
> increased.(no of rows in each partition should also increase as we are
> inserting new data into the partitioned table.
> Steps to reproduce:
> 1) Create Partitioned table:
> CREATE TABLE uniqdata_part (CUST_ID int,CUST_NAME String,ACTIVE_EMUI_VERSION
> string,DOJ timestamp, BIGINT_COLUMN1 bigint,BIGINT_COLUMN2
> bigint,DECIMAL_COLUMN1 decimal(30,10), DECIMAL_COLUMN2
> decimal(36,10),Double_COLUMN1 double, Double_COLUMN2 double,INTEGER_COLUMN1
> int) PARTITIONED BY (DOB Timestamp) STORED BY 'org.apache.carbondata.format'
> TBLPROPERTIES ('PARTITION_TYPE'='RANGE','RANGE_INFO'='1971-01-01 01:00:03,
> 1972-01-01 01:00:03, 1974-01-01 01:00:03',"TABLE_BLOCKSIZE"= "256 MB")
> 2) Load data into the partitioned table:
> LOAD DATA INPATH 'hdfs://localhost:54310/uniqdata/2000_UniqData.csv' into
> table uniqdata_part OPTIONS('DELIMITER'=',' ,
> 'QUOTECHAR'='"','BAD_RECORDS_ACTION'='FORCE','FILEHEADER'='CUST_ID,CUST_NAME,ACTIVE_EMUI_VERSION,DOB,DOJ,BIGINT_COLUMN1,BIGINT_COLUMN2,DECIMAL_COLUMN1,DECIMAL_COLUMN2,Double_COLUMN1,Double_COLUMN2,INTEGER_COLUMN1')
> 3) Create another table:
> CREATE TABLE uniqdata_1 (CUST_ID int,CUST_NAME String,ACTIVE_EMUI_VERSION
> string, DOB timestamp, DOJ timestamp, BIGINT_COLUMN1 bigint,BIGINT_COLUMN2
> bigint,DECIMAL_COLUMN1 decimal(30,10), DECIMAL_COLUMN2
> decimal(36,10),Double_COLUMN1 double, Double_COLUMN2 double,INTEGER_COLUMN1
> int) STORED BY 'org.apache.carbondata.format' TBLPROPERTIES
> ("TABLE_BLOCKSIZE"= "256 MB")
> 4) Load data into this table:
> LOAD DATA INPATH 'hdfs://localhost:54310/uniqdata/2000_UniqData.csv' into
> table uniqdata_1 OPTIONS('DELIMITER'=',' ,
> 'QUOTECHAR'='"','BAD_RECORDS_ACTION'='FORCE','FILEHEADER'='CUST_ID,CUST_NAME,ACTIVE_EMUI_VERSION,DOB,DOJ,BIGINT_COLUMN1,BIGINT_COLUMN2,DECIMAL_COLUMN1,DECIMAL_COLUMN2,Double_COLUMN1,Double_COLUMN2,INTEGER_COLUMN1')
> 5)Execute Queries:
> a) show partitions query:
> show partitions uniqdata_part
> Output:
> 0, dob = DEFAULT
> 1, dob < 1971-01-01 01:00:03
> 2, 1971-01-01 01:00:03 <= dob < 1972-01-01 01:00:03
> 3, 1972-01-01 01:00:03 <= dob < 1974-01-01 01:00:03
> b) Query for row count in the partitioned table:
> select count(*) from uniqdata_part
> Output:
> count(1) |
> +-----------+--+
> | 2013
> c)query for row count in partition 0:
> select count(*) from uniqdata_part where dob >= '1974-01-01 01:00:03'
> Output:
> count(1) |
> +-----------+--+
> | 539
> d) query for row count in partition 1 :
> select count(*) from uniqdata_part where dob < '1971-01-01 01:00:03'
> Output:
> count(1) |
> +-----------+--+
> | 366
> e) query for row count in partition 3:
> select count(*) from uniqdata_part where dob >= '1971-01-01 01:00:03' and dob
> < '1972-01-01 01:00:03'
> Output:
> count(1) |
> +-----------+--+
> | 365
> f) query for row count in partition 4:
> select count(*) from uniqdata_part where dob >= '1972-01-01 01:00:03' and dob
> < '1974-01-01 01:00:03'
> Output:
> count(1) |
> +-----------+--+
> | 731
> g) Insert data in partitioned table through the normal table:
> insert into uniqdata_part select * from uniqdata_1;
> h) Query for row count in the partitioned table after insertion operation:
> select count(*) from uniqdata_part
> Output:
> count(1) |
> +-----------+--+
> | 4026
> i) Repeat queries from (c) to (f) for row count in the each partition.
> 6) Actual Result: it shows the same row count for each partition but the
> partitioned table has more rows in it.
> 7)Expected Result: No of rows in each partition should increase as no of rows
> increases in partitioned table
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)