[ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132472#comment-14132472
 ] 

Fi commented on SPARK-2883:
---------------------------

I was able to run a simple query and access my ORC hive table through the 
PySpark shell.

Happily, things seemed to work.

However, the problem is I/O efficiency.

Looking at the Spark:4040 UI, it is clear that the Spark is reading the entire 
ORC file from HDFS, instead of taking advantage of the columnar format and 
reading just the columns I was requesting.

The physical size of the partition I am querying is 57GB in size.
The Spark UI shows that it read the full 57GB, even though I queried a handful 
of columns.

I tried the equivalent query in standard hive. The Job Tracker showed that it 
only read 1GB of data from HDFS, which is closer to what I was expecting (and 
results in 57x less network/disk I/O).

Would be great if Spark SQL would perform similarly to regular Hive.

I will attach a couple screenshots from the Spark UI and the Job Tracker.

Here was my PySpark command

sqlc = HiveContext(sc)

rdd = sqlc.sql("SELECT "
               "  r.ts, "
               "  r.pid, r.sid, r.aid, r.netid "
               "FROM "
               "  orc_rt.orc_ai_6000 r "
               "WHERE "
               "  r.year=2014 and r.MONTH=9 and r.DAY=11 and hour(ts) == 12 "
               "LIMIT 1000")
rdd.collect()

and here was what I ran in hive

select ts, pid, sid, aid, netid from orc_rt.orc_ai_6000 where year=2014 and 
month=9 and day=11 and hour(ts) == 12 limit 1000;


Software
=======
Spark 1.1.0
Mesos 0.18.2
Hive 0.12
IPython 1.2.1
Python 2.7.6
MaprV3
Docker 1.1.2   (mesos/spark running in a docker container)
Kernel: Linux 2.6.32-431.23.3.el6.x86_64 #1 SMP Thu Jul 31 17:20:51 UTC 2014 
x86_64 x86_64 x86_64 GNU/Linux
Host OS: CentOS release 6.5 (Final)  (running under a XEN Hypervisor)


> Spark Support for ORCFile format
> --------------------------------
>
>                 Key: SPARK-2883
>                 URL: https://issues.apache.org/jira/browse/SPARK-2883
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output, SQL
>            Reporter: Zhan Zhang
>
> Verify the support of OrcInputFormat in spark, fix issues if exists and add 
> documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to