[ 
https://issues.apache.org/jira/browse/HIVE-7333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209358#comment-14209358
 ] 

Kiran Lonikar commented on HIVE-7333:
-------------------------------------

Thanks. Considering what Reynold said, I looked into the spark sql docs. Look 
at 
https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory

It says the caching in columnar format (like the one Reynold was alluding to) 
is enabled by calling cacheTable on the SchemaRDD. I think same is true from 
the SQL interface "CACHE TABLE tableName" command. 

I think you can re-run your performance tests using this (after caching the 
tables this way).

I think looking the code of SchemaRDD.paraquetFile may also help in reading 
multiple rows at the same so performance improves even when reading.

Using vectorization has another benefit that it can run on GPUs.

> Create RDD translator, translating Hive Tables into Spark RDDs [Spark Branch]
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-7333
>                 URL: https://issues.apache.org/jira/browse/HIVE-7333
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Rui Li
>              Labels: Spark-M1
>
> Please refer to the design specification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to