Daniel Harper created SPARK-13000:
-------------------------------------
Summary: Corrupted results when using LIMIT clause via JDBC
connections to ThriftServer
Key: SPARK-13000
URL: https://issues.apache.org/jira/browse/SPARK-13000
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.5.2
Environment: Amazon EMR AMI 4.2.0
Spark 1.5.2
Reporter: Daniel Harper
h2. Steps to reproduce
#. Create table in HIVE (see below for definition)
#. Insert some data (at least 2 rows)
#. Start thrift service
#. Connect to thrift service via {{beeline}} or custom application via JDBC
#. Run query {{select * from logs_table limit 1}}
h2. Detailed description
We're seeing strange results for the following query when executed via JDBC
connections to the thrift server
{code}
select * from logs_table limit 1;
{code}
We've tried this using {{beeline}} and as you can see, the {{service}} and
other columns are blank
{code}
[hadoop@ip-x ~]$ beeline
Beeline version 1.0.0-amzn-1 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10001/default
scan complete in 5ms
Connecting to jdbc:hive2://localhost:10001/default
Enter username for jdbc:hive2://localhost:10001/default:
Enter password for jdbc:hive2://localhost:10001/default:
Connected to: Spark SQL (version 1.5.2)
Driver: Hive JDBC (version 1.0.0-amzn-1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10001/default> select * from logs_table limit 1;
+------------------------+----------+-------+-----+-----+--+
| ts | service | yyyy | mm | dd |
+------------------------+----------+-------+-----+-----+--+
| 2016-01-24 23:23:24.0 | | | | |
+------------------------+----------+-------+-----+-----+--+
1 row selected (9.182 seconds)
{code}
Removing the {{LIMIT 1}} clause, we get the full dataset and all columns are
present.
{code}
0: jdbc:hive2://localhost:10001/default> select * from logs_table;
+------------------------+----------+-------+-----+-----+--+
| ts | service | yyyy | mm | dd |
+------------------------+----------+-------+-----+-----+--+
| 2016-01-24 23:23:24.0 |service_1 | 2016 | 01 | 24 |
| 2016-01-24 23:29:24.0 |service_4 | 2016 | 01 | 24 |
+------------------------+----------+-------+-----+-----+--+
2 rows selected (10.956 seconds)
{code}
I ran the query {{select * from logs_table limit 1}} via
* {{spark-sql}}
* {{spark-shell}}
...and both returned the expected results, limiting the resultset to 1 row and
with all the columns populated.
This leads me to believe this is an issue with the Thrift Server or Hive JDBC
driver.
We are starting the thrift server as follows:
{code}
sudo /usr/lib/spark/sbin/start-thriftserver.sh --hiveconf
hive.server2.thrift.port=10001 --num-executors 1 --executor-cores 5
--executor-memory 38G --conf spark.scheduler.mode=FAIR --conf
spark.sql.thriftserver.scheduler.pool=default --driver-memory 10G
{code}
h2. Resources
The HIVE table is defined as follows:
{code}
CREATE EXTERNAL TABLE IF NOT EXISTS logs_table (
ts STRING,
service STRING
)
COMMENT 'logs table'
PARTITIONED BY (yyyy STRING, mm STRING, dd STRING)
STORED AS TEXTFILE
LOCATION 's3://data-lake/structured/';
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]