[
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14171654#comment-14171654
]
Zhan Zhang commented on SPARK-2883:
-----------------------------------
I almost finished the prototype, and following is the draft spec for this jira.
I will wrap up my patch, and upload soon. The patch is very small with less
than 1,000 code including testing. But since there is another PR already opened
with the duplicated Jira with this one, I may work with the author of that jira
to consolidate our work.
1. Basic Operator: saveAsOrcFile and OrcFile. The former is used to save the
table into orc format file, and the latter is used to import orc format file
into spark sql table.
2. Self-contained schema support: The orc support is fully functional
independent of hive metastore. The table schema is maintained by the orc file
itself.
3. To support the orc file, user need to: import import
org.apache.spark.sql.hive.orc._ to bring in the orc support into context
4. The orc file is operated in HiveContext, the only reason is due to package
issue, and we don’t want to bring in hive dependency into spark sql. Note that
orc operations does not relies on Hive metastore.
5. It support full complicated dataType in Spark Sql, for example, list, seq,
and nested datatype.
Following is the example to use orc file, which is almost identical to the
parquet format support from user perspective.
import org.apache.spark.sql.hive.orc._
val ctx = new org.apache.spark.sql.hive.HiveContext(sc)
val people = sc.textFile("examples/src/main/resources/people.txt")
val schemaString = "name age"
import org.apache.spark.sql._
val schema = StructType(schemaString.split(" ").map(fieldName =>
StructField(fieldName, StringType, true)))
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
val peopleSchemaRDD = ctx.applySchema(rowRDD, schema)
peopleSchemaRDD.registerTempTable("people")
val results = ctx.sql("SELECT name FROM people")
results.map(t => "Name: " + t(0)).collect().foreach(println)
peopleSchemaRDD.saveAsOrcFile("people.orc")
val orcFile = ctx.orcFile("people.orc")
orcFile.registerTempTable("orcFile")
val teenagers = ctx.sql("SELECT name FROM orcFile WHERE age >= 13 AND age <=
19")
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
Hive 0.13.1 support.
With minor change, after spark hive upgraded to 0.13.1
1. the orc can support different compression method, e.g., SNAPPY, LZO, ZLIB,
and NONE
2. prediction pushdown
> Spark Support for ORCFile format
> --------------------------------
>
> Key: SPARK-2883
> URL: https://issues.apache.org/jira/browse/SPARK-2883
> Project: Spark
> Issue Type: Bug
> Components: Input/Output, SQL
> Reporter: Zhan Zhang
> Priority: Blocker
> Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19
> pm jobtracker.png
>
>
> Verify the support of OrcInputFormat in spark, fix issues if exists and add
> documentation of its usage.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]