[
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132472#comment-14132472
]
Fi commented on SPARK-2883:
---------------------------
I was able to run a simple query and access my ORC hive table through the
PySpark shell.
Happily, things seemed to work.
However, the problem is I/O efficiency.
Looking at the Spark:4040 UI, it is clear that the Spark is reading the entire
ORC file from HDFS, instead of taking advantage of the columnar format and
reading just the columns I was requesting.
The physical size of the partition I am querying is 57GB in size.
The Spark UI shows that it read the full 57GB, even though I queried a handful
of columns.
I tried the equivalent query in standard hive. The Job Tracker showed that it
only read 1GB of data from HDFS, which is closer to what I was expecting (and
results in 57x less network/disk I/O).
Would be great if Spark SQL would perform similarly to regular Hive.
I will attach a couple screenshots from the Spark UI and the Job Tracker.
Here was my PySpark command
sqlc = HiveContext(sc)
rdd = sqlc.sql("SELECT "
" r.ts, "
" r.pid, r.sid, r.aid, r.netid "
"FROM "
" orc_rt.orc_ai_6000 r "
"WHERE "
" r.year=2014 and r.MONTH=9 and r.DAY=11 and hour(ts) == 12 "
"LIMIT 1000")
rdd.collect()
and here was what I ran in hive
select ts, pid, sid, aid, netid from orc_rt.orc_ai_6000 where year=2014 and
month=9 and day=11 and hour(ts) == 12 limit 1000;
Software
=======
Spark 1.1.0
Mesos 0.18.2
Hive 0.12
IPython 1.2.1
Python 2.7.6
MaprV3
Docker 1.1.2 (mesos/spark running in a docker container)
Kernel: Linux 2.6.32-431.23.3.el6.x86_64 #1 SMP Thu Jul 31 17:20:51 UTC 2014
x86_64 x86_64 x86_64 GNU/Linux
Host OS: CentOS release 6.5 (Final) (running under a XEN Hypervisor)
> Spark Support for ORCFile format
> --------------------------------
>
> Key: SPARK-2883
> URL: https://issues.apache.org/jira/browse/SPARK-2883
> Project: Spark
> Issue Type: Bug
> Components: Input/Output, SQL
> Reporter: Zhan Zhang
>
> Verify the support of OrcInputFormat in spark, fix issues if exists and add
> documentation of its usage.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]