wuchang created SPARK-19647:
-------------------------------
Summary: Spark query hive is extremelly slow even the result data
is small
Key: SPARK-19647
URL: https://issues.apache.org/jira/browse/SPARK-19647
Project: Spark
Issue Type: Question
Components: PySpark
Affects Versions: 2.0.2
Reporter: wuchang
Priority: Critical
I am using spark 2.0.0 to query hive table:
my sql is:
select * from app.abtestmsg_v limit 10
Yes, I want to get the first 10 records from a view app.abtestmsg_v.
When I run this sql in spark-shell,it is very fast, USE about 2 seconds .
But then the problem comes when I try to implement this query by my python code.
I am using Spark 2.0.0 and write a very simple pyspark program, code is:
from pyspark.sql import HiveContext
from pyspark.sql.functions import *
import json
hc = HiveContext(sc)
hc.setConf("hive.exec.orc.split.strategy", "ETL")
hc.setConf("hive.security.authorization.enabled",false)
zj_sql = 'select * from app.abtestmsg_v limit 10'
zj_df = hc.sql(zj_sql)
zj_df.collect()
>From the info log , I find: although I use "limit 10" to tell spark that I
>just want the first 10 records , but spark still scan and read all files(in my
>case, the source data of this view contains 100 files and each file's size is
>about 1G) of the view , So , there are nearly 100 tasks , each task read a
>file , and all the task is executed serially. I use nearlly 15 minutes to
>finish these 100 tasks!!!!! but what I want is just to get the first 10
>records.
So , I don't know what to do and what is wrong;
Anybode could give me some suggestions?
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]