[ https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14171654#comment-14171654 ]
Zhan Zhang edited comment on SPARK-2883 at 10/14/14 11:04 PM: -------------------------------------------------------------- I almost finished the prototype, and following is the draft spec for this jira. I will wrap up my patch, and upload soon. This is a small patch with around 1,000 lines of code including testing suite. Since there is another PR already opened with the duplicated Jira spark-3720 with this one, I may work with the author of that jira to consolidate our work. But if anybody think opening another PR is better (not necessary to commit), please let me know. 1. Basic Operator: saveAsOrcFile and OrcFile. The former is used to save the table into orc format file, and the latter is used to import orc format file into spark sql table. 2. Column pruning 3. Self-contained schema support: The orc support is fully functional independent of hive metastore. The table schema is maintained by the orc file itself. 4. To support the orc file, user need to: import import org.apache.spark.sql.hive.orc._ to bring in the orc support into context 5. The orc file is operated in HiveContext, the only reason is due to package issue, and we don’t want to bring in hive dependency into spark sql. Note that orc operations does not relies on Hive metastore. 6. It support full complicated dataType in Spark Sql, for example, list, seq, and nested datatype. Hive 0.13.1 support. With minor change, after spark hive upgraded to 0.13.1 1. the orc can support different compression method, e.g., SNAPPY, LZO, ZLIB, and NONE 2. prediction pushdown Following is the example to use orc file, which is almost identical to the parquet format support from user perspective. import org.apache.spark.sql.hive.orc._ val ctx = new org.apache.spark.sql.hive.HiveContext(sc) val people = sc.textFile("examples/src/main/resources/people.txt") val schemaString = "name age" import org.apache.spark.sql._ val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true))) val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim)) val peopleSchemaRDD = ctx.applySchema(rowRDD, schema) peopleSchemaRDD.registerTempTable("people") val results = ctx.sql("SELECT name FROM people") results.map(t => "Name: " + t(0)).collect().foreach(println) peopleSchemaRDD.saveAsOrcFile("people.orc") val orcFile = ctx.orcFile("people.orc") orcFile.registerTempTable("orcFile") val teenagers = ctx.sql("SELECT name FROM orcFile WHERE age >= 13 AND age <= 19") teenagers.map(t => "Name: " + t(0)).collect().foreach(println) was (Author: zzhan): I almost finished the prototype, and following is the draft spec for this jira. I will wrap up my patch, and upload soon. This is a small patch with around 1,000 lines of code including testing suite. Since there is another PR already opened with the duplicated Jira spark-3720 with this one, I may work with the author of that jira to consolidate our work. But if anybody think opening another PR is better (not necessary to commit), please let me know. 1. Basic Operator: saveAsOrcFile and OrcFile. The former is used to save the table into orc format file, and the latter is used to import orc format file into spark sql table. 2. Column pruning 3. Self-contained schema support: The orc support is fully functional independent of hive metastore. The table schema is maintained by the orc file itself. 4. To support the orc file, user need to: import import org.apache.spark.sql.hive.orc._ to bring in the orc support into context 5. The orc file is operated in HiveContext, the only reason is due to package issue, and we don’t want to bring in hive dependency into spark sql. Note that orc operations does not relies on Hive metastore. 6. It support full complicated dataType in Spark Sql, for example, list, seq, and nested datatype. Following is the example to use orc file, which is almost identical to the parquet format support from user perspective. import org.apache.spark.sql.hive.orc._ val ctx = new org.apache.spark.sql.hive.HiveContext(sc) val people = sc.textFile("examples/src/main/resources/people.txt") val schemaString = "name age" import org.apache.spark.sql._ val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true))) val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim)) val peopleSchemaRDD = ctx.applySchema(rowRDD, schema) peopleSchemaRDD.registerTempTable("people") val results = ctx.sql("SELECT name FROM people") results.map(t => "Name: " + t(0)).collect().foreach(println) peopleSchemaRDD.saveAsOrcFile("people.orc") val orcFile = ctx.orcFile("people.orc") orcFile.registerTempTable("orcFile") val teenagers = ctx.sql("SELECT name FROM orcFile WHERE age >= 13 AND age <= 19") teenagers.map(t => "Name: " + t(0)).collect().foreach(println) Hive 0.13.1 support. With minor change, after spark hive upgraded to 0.13.1 1. the orc can support different compression method, e.g., SNAPPY, LZO, ZLIB, and NONE 2. prediction pushdown > Spark Support for ORCFile format > -------------------------------- > > Key: SPARK-2883 > URL: https://issues.apache.org/jira/browse/SPARK-2883 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL > Reporter: Zhan Zhang > Priority: Blocker > Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 > pm jobtracker.png > > > Verify the support of OrcInputFormat in spark, fix issues if exists and add > documentation of its usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org