[ 
https://issues.apache.org/jira/browse/SPARK-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

patrickliu updated SPARK-4252:
------------------------------
    Description: 
Hive will ignore illegal record, while SparkSQL will try to convert illegal 
record.

Assume I have a text file user.txt with 2 records(userName, age):
Alice,12.4
Bob,13

Then I create a Hive table to query the data:
CREATE TABLE user(
    name string,
    age int, (Pay attention! The field is int)
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' ;

LOAD DATA LOCAL INPATH 'user' INTO TABLE user;

Then I use Hive and SparkSQL to query the 'user' table:
SQL: select * from user;

Result by Hive:
Alice NULL( Hive ignore Alice's age because it is a float number )
Bob 13

Result by SparkSQL:
Alice 12 ( SparkSQL converts Alice's age from float to int )
Bob 13

So if I run, "select sum(age) from user;"
Then I will get different result.

Maybe SparkSQL should be compatible with Hive in this scenario.

  was:
Hive will ignore illegal record, while SparkSQL will try to convert illegal 
record.

Assume I have a text file user.txt with 2 records(userName, age):
Alice,12.4
Bob,13

Then I create a Hive table to query the data:
CREATE TABLE user(
    name string,
    age int,
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' ;

LOAD DATA LOCAL INPATH 'user' INTO TABLE user;

Then I use Hive and SparkSQL to query the 'user' table:
SQL: select * from user;

Result by Hive:
Alice NULL( Hive ignore Alice's age because it is a float number )
Bob 13

Result by SparkSQL:
Alice 12 ( SparkSQL converts Alice's age from float to int )
Bob 13

So if I run, "select sum(age) from user;"
Then I will get different result.

Maybe SparkSQL should be compatible with Hive in this scenario.


> SparkSQL behaves differently from Hive when encountering illegal record
> -----------------------------------------------------------------------
>
>                 Key: SPARK-4252
>                 URL: https://issues.apache.org/jira/browse/SPARK-4252
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.1.0
>            Reporter: patrickliu
>
> Hive will ignore illegal record, while SparkSQL will try to convert illegal 
> record.
> Assume I have a text file user.txt with 2 records(userName, age):
> Alice,12.4
> Bob,13
> Then I create a Hive table to query the data:
> CREATE TABLE user(
>     name string,
>     age int, (Pay attention! The field is int)
> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' ;
> LOAD DATA LOCAL INPATH 'user' INTO TABLE user;
> Then I use Hive and SparkSQL to query the 'user' table:
> SQL: select * from user;
> Result by Hive:
> Alice NULL( Hive ignore Alice's age because it is a float number )
> Bob 13
> Result by SparkSQL:
> Alice 12 ( SparkSQL converts Alice's age from float to int )
> Bob 13
> So if I run, "select sum(age) from user;"
> Then I will get different result.
> Maybe SparkSQL should be compatible with Hive in this scenario.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to