xuchuanyin created CARBONDATA-1700:
--------------------------------------

             Summary: Failed to load data to existed table after spark session 
restarted
                 Key: CARBONDATA-1700
                 URL: https://issues.apache.org/jira/browse/CARBONDATA-1700
             Project: CarbonData
          Issue Type: Bug
          Components: data-load
    Affects Versions: 1.3.0
            Reporter: xuchuanyin
            Assignee: xuchuanyin
             Fix For: 1.3.0


# scenario

I encounterd loading data to existed carbondata table failure after query the 
table after restarting spark session. I have this failure in spark local mode 
(found it during local test) and haven't test in other scenarioes.

The problem can be reproduced by following steps:

0. START: start a session;
1. CREATE: create table `t1`;
2. LOAD: create a dataframe and write apppend to `t1`;
3. STOP: stop current session;

4. START: start a session;
5. QUERY: query table `t1`;  ----  This step is essential to reproduce the 
problem.
6. LOAD: create a dataframe and write append to `t1`;  --- This step will be 
failed.

Error will be thrown in Step6. The error message in console looks like

```
java.lang.NullPointerException was thrown.
java.lang.NullPointerException
at 
org.apache.spark.sql.execution.command.management.LoadTableCommand.processData(LoadTableCommand.scala:92)
at 
org.apache.spark.sql.execution.command.management.LoadTableCommand.run(LoadTableCommand.scala:60)
at 
org.apache.spark.sql.CarbonDataFrameWriter.loadDataFrame(CarbonDataFrameWriter.scala:141)
at 
org.apache.spark.sql.CarbonDataFrameWriter.writeToCarbonFile(CarbonDataFrameWriter.scala:50)
at 
org.apache.spark.sql.CarbonDataFrameWriter.appendToCarbonFile(CarbonDataFrameWriter.scala:42)
at org.apache.spark.sql.CarbonSource.createRelation(CarbonSource.scala:110)
at 
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
```

The following code can be pasted in `TestLoadDataFrame.scala` to reproduce this 
problem —— but keep
in mind you should manually run the first test and then the second in different 
iteration (to make sure that the sparksession is restarted).

```
  test("prepare") {
    sql("drop table if exists carbon_stand_alone")
    sql( "create table if not exists carbon_stand_alone (c1 string, c2 string, 
c3 int)" +
    " stored by 'carbondata'").collect()
    sql("select * from carbon_stand_alone").show()
    df.write
      .format("carbondata")
      .option("tableName", "carbon_stand_alone")
      .option("tempCSV", "false")
      .mode(SaveMode.Append)
      .save()
  }

  test("test load dataframe after query") {

    sql("select * from carbon_stand_alone").show()

    // the following line will cause failure
    df.write
      .format("carbondata")
      .option("tableName", "carbon_stand_alone")
      .option("tempCSV", "false")
      .mode(SaveMode.Append)
      .save()

    // if it works fine, it sould be true
    checkAnswer(
      sql("select count(*) from carbon_stand_alone where c3 > 500"), Row(31500 
* 2)
    )
  }
```

# ANALYSE
I went through the code and found the problem was caused by NULL 
`tableProperties` in `tablemeta: tableMeta.carbonTable.getTableInfo
      .getFactTable.getTableProperties` (we will name it `propertyInTableInfo` 
for short) is null in Line89 in `LoadTableCommand.scala`.

After debug, I found that the `propertyInTableInfo` sett in 
`CarbonTableInputFormat.setTableInfo(...)` had the correct value. But 
`CarbonTableInputFormat.getTableInfo(...)` had the incorrect value. The setter 
is used to serialized TableInfo, while the getter is used to deserialized 
TableInfo ———— That means there are something wrong in 
serialization-deserialization.

Keep diving into the code, I found that serialization and deserialization in 
`TableSchema`, a member of `TableInfo`, ignores the `tableProperties` member, 
thus causing this value empty after deserialization. Since this value has not 
been initialized in construtor, so the value remains `NULL` and cause the NPE 
problem.

# RESOLVE

1. Initialize `tableProperties` in `TableSchema`
2. Include `tableProperties` in serialization-deserialization of `TableSchema`

# Notes

Although the bug has been fix, I still can't understand why the problem can be 
triggered in above way.

Tests need the sparksession to be restarted, which is impossible currently, so 
no tests will be added.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to