[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format

Zhan Zhang (JIRA) Tue, 14 Oct 2014 15:42:07 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14171654#comment-14171654
 ]


Zhan Zhang commented on SPARK-2883:
-----------------------------------

I almost finished the prototype, and following is the draft spec for this jira. 
I will wrap up my patch, and upload soon. The patch is very small with less 
than 1,000 code including testing. But since there is another PR already opened 
with the duplicated Jira with this one, I may work with the author of that jira 
to consolidate our work.

1. Basic Operator: saveAsOrcFile and OrcFile. The former is used to save the 
table into orc format file, and the latter is used to import orc format file 
into spark sql table.
2. Self-contained schema support: The orc support is fully functional 
independent of hive metastore. The table schema is maintained by the orc file 
itself.
3. To support the orc file, user need to:  import import 
org.apache.spark.sql.hive.orc._ to bring in the orc support into context
4. The orc file is operated in HiveContext, the only reason is due to package 
issue, and we don’t want to bring in hive dependency into spark sql. Note that 
orc operations does not relies on Hive metastore.
5. It support full complicated dataType in Spark Sql, for example, list, seq, 
and nested datatype.

Following is the example to use orc file, which is almost identical to the 
parquet format support from user perspective.

import org.apache.spark.sql.hive.orc._
val ctx = new org.apache.spark.sql.hive.HiveContext(sc)

val people = sc.textFile("examples/src/main/resources/people.txt")
val schemaString = "name age"
import org.apache.spark.sql._
val schema = StructType(schemaString.split(" ").map(fieldName => 
StructField(fieldName, StringType, true)))
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
val peopleSchemaRDD = ctx.applySchema(rowRDD, schema)
peopleSchemaRDD.registerTempTable("people")
val results = ctx.sql("SELECT name FROM people")
results.map(t => "Name: " + t(0)).collect().foreach(println)
peopleSchemaRDD.saveAsOrcFile("people.orc")
val orcFile = ctx.orcFile("people.orc")
orcFile.registerTempTable("orcFile")
val teenagers = ctx.sql("SELECT name FROM orcFile WHERE age >= 13 AND age <= 
19")
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

Hive 0.13.1 support.
With minor change, after spark hive upgraded to 0.13.1
1. the orc can support different compression method, e.g., SNAPPY, LZO, ZLIB, 
and NONE
2. prediction pushdown

> Spark Support for ORCFile format
> --------------------------------
>
>                 Key: SPARK-2883
>                 URL: https://issues.apache.org/jira/browse/SPARK-2883
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output, SQL
>            Reporter: Zhan Zhang
>            Priority: Blocker
>         Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 
> pm jobtracker.png
>
>
> Verify the support of OrcInputFormat in spark, fix issues if exists and add 
> documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format

Reply via email to