[jira] [Comment Edited] (SPARK-2883) Spark Support for ORCFile format

Zhan Zhang (JIRA) Tue, 14 Oct 2014 16:04:57 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14171654#comment-14171654
 ]


Zhan Zhang edited comment on SPARK-2883 at 10/14/14 11:04 PM:
--------------------------------------------------------------

I almost finished the prototype, and following is the draft spec for this jira. 
I will wrap up my patch, and upload soon. This is a small patch with around 
1,000 lines of code including testing suite. Since there is another PR already 
opened with the duplicated Jira spark-3720 with this one, I may work with the 
author of that jira to consolidate our work. But if anybody think opening 
another PR is better (not necessary to commit), please let me know.

1. Basic Operator: saveAsOrcFile and OrcFile. The former is used to save the 
table into orc format file, and the latter is used to import orc format file 
into spark sql table.
2. Column pruning
3. Self-contained schema support: The orc support is fully functional 
independent of hive metastore. The table schema is maintained by the orc file 
itself.
4. To support the orc file, user need to:  import import 
org.apache.spark.sql.hive.orc._ to bring in the orc support into context
5. The orc file is operated in HiveContext, the only reason is due to package 
issue, and we don’t want to bring in hive dependency into spark sql. Note that 
orc operations does not relies on Hive metastore.
6. It support full complicated dataType in Spark Sql, for example, list, seq, 
and nested datatype.

Hive 0.13.1 support.
With minor change, after spark hive upgraded to 0.13.1
1. the orc can support different compression method, e.g., SNAPPY, LZO, ZLIB, 
and NONE
2. prediction pushdown

Following is the example to use orc file, which is almost identical to the 
parquet format support from user perspective.

import org.apache.spark.sql.hive.orc._
val ctx = new org.apache.spark.sql.hive.HiveContext(sc)

val people = sc.textFile("examples/src/main/resources/people.txt")
val schemaString = "name age"
import org.apache.spark.sql._
val schema = StructType(schemaString.split(" ").map(fieldName => 
StructField(fieldName, StringType, true)))
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
val peopleSchemaRDD = ctx.applySchema(rowRDD, schema)
peopleSchemaRDD.registerTempTable("people")
val results = ctx.sql("SELECT name FROM people")
results.map(t => "Name: " + t(0)).collect().foreach(println)
peopleSchemaRDD.saveAsOrcFile("people.orc")
val orcFile = ctx.orcFile("people.orc")
orcFile.registerTempTable("orcFile")
val teenagers = ctx.sql("SELECT name FROM orcFile WHERE age >= 13 AND age <= 
19")
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)


was (Author: zzhan):
I almost finished the prototype, and following is the draft spec for this jira. 
I will wrap up my patch, and upload soon. This is a small patch with around 
1,000 lines of code including testing suite. Since there is another PR already 
opened with the duplicated Jira spark-3720 with this one, I may work with the 
author of that jira to consolidate our work. But if anybody think opening 
another PR is better (not necessary to commit), please let me know.

1. Basic Operator: saveAsOrcFile and OrcFile. The former is used to save the 
table into orc format file, and the latter is used to import orc format file 
into spark sql table.
2. Column pruning
3. Self-contained schema support: The orc support is fully functional 
independent of hive metastore. The table schema is maintained by the orc file 
itself.
4. To support the orc file, user need to:  import import 
org.apache.spark.sql.hive.orc._ to bring in the orc support into context
5. The orc file is operated in HiveContext, the only reason is due to package 
issue, and we don’t want to bring in hive dependency into spark sql. Note that 
orc operations does not relies on Hive metastore.
6. It support full complicated dataType in Spark Sql, for example, list, seq, 
and nested datatype.

Following is the example to use orc file, which is almost identical to the 
parquet format support from user perspective.

import org.apache.spark.sql.hive.orc._
val ctx = new org.apache.spark.sql.hive.HiveContext(sc)

val people = sc.textFile("examples/src/main/resources/people.txt")
val schemaString = "name age"
import org.apache.spark.sql._
val schema = StructType(schemaString.split(" ").map(fieldName => 
StructField(fieldName, StringType, true)))
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
val peopleSchemaRDD = ctx.applySchema(rowRDD, schema)
peopleSchemaRDD.registerTempTable("people")
val results = ctx.sql("SELECT name FROM people")
results.map(t => "Name: " + t(0)).collect().foreach(println)
peopleSchemaRDD.saveAsOrcFile("people.orc")
val orcFile = ctx.orcFile("people.orc")
orcFile.registerTempTable("orcFile")
val teenagers = ctx.sql("SELECT name FROM orcFile WHERE age >= 13 AND age <= 
19")
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

Hive 0.13.1 support.
With minor change, after spark hive upgraded to 0.13.1
1. the orc can support different compression method, e.g., SNAPPY, LZO, ZLIB, 
and NONE
2. prediction pushdown

> Spark Support for ORCFile format
> --------------------------------
>
>                 Key: SPARK-2883
>                 URL: https://issues.apache.org/jira/browse/SPARK-2883
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output, SQL
>            Reporter: Zhan Zhang
>            Priority: Blocker
>         Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 
> pm jobtracker.png
>
>
> Verify the support of OrcInputFormat in spark, fix issues if exists and add 
> documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2883) Spark Support for ORCFile format

Reply via email to