Benjamin BONNET created SPARK-16996:
---------------------------------------
Summary: Hive ACID delta files not seen
Key: SPARK-16996
URL: https://issues.apache.org/jira/browse/SPARK-16996
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.5.2
Environment: Hive 1.2.1, Spark 1.5.2
Reporter: Benjamin BONNET
Priority: Critical
spark-sql seems not to see data stored as delta files in an ACID Hive table.
Actually I encountered the same problem as describe here :
http://stackoverflow.com/questions/35955666/spark-sql-is-not-returning-records-for-hive-transactional-tables-on-hdp
For example, create an ACID table with HiveCLI and insert a row :
{code}
set hive.support.concurrency=true;
set hive.enforce.bucketing=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.compactor.initiator.on=true;
set hive.compactor.worker.threads=1;
CREATE TABLE deltas(cle string,valeur string) CLUSTERED BY (cle) INTO 1 BUCKETS
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES ('transactional'='true');
INSERT INTO deltas VALUES("a","a");
{code}
Then make a query with spark-sql CLI :
{code}
SELECT * FROM deltas;
{code}
That query gets no result and there are no errors in logs.
If you go to HDFS to inspect table files, you find only deltas
{code}
~>hdfs dfs -ls /apps/hive/warehouse/deltas
Found 1 items
drwxr-x--- - me hdfs 0 2016-08-10 14:03
/apps/hive/warehouse/deltas/delta_0020943_0020943
{code}
Then if you run compaction on that table (in HiveCLI) :
{code}
ALTER TABLE deltas COMPACT 'MAJOR';
{code}
As a result, the delta will be compute into a base file :
{code}
~>hdfs dfs -ls /apps/hive/warehouse/deltas
Found 1 items
drwxrwxrwx - me hdfs 0 2016-08-10 15:25
/apps/hive/warehouse/deltas/base_0020943
{code}
Go back to spark-sql and the same query gets a result :
{code}
SELECT * FROM deltas;
a a
Time taken: 0.477 seconds, Fetched 1 row(s)
{code}
But next time you make an insert into Hive table :
{code}
INSERT INTO deltas VALUES("b","b");
{code}
spark-sql will immediately see changes :
{code}
SELECT * FROM deltas;
a a
b b
Time taken: 0.122 seconds, Fetched 2 row(s)
{code}
Yet there was no other compaction, but spark-sql "sees" the base AND the delta
file :
{code}
~> hdfs dfs -ls /apps/hive/warehouse/deltas
Found 2 items
drwxrwxrwx - valdata hdfs 0 2016-08-10 15:25
/apps/hive/warehouse/deltas/base_0020943
drwxr-x--- - valdata hdfs 0 2016-08-10 15:31
/apps/hive/warehouse/deltas/delta_0020956_0020956
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]