[GitHub] spark pull request #15898: [SPARK-18457][SQL] ORC and other columnar formats...

aray Tue, 15 Nov 2016 20:17:03 -0800

GitHub user aray opened a pull request:

    https://github.com/apache/spark/pull/15898


    [SPARK-18457][SQL] ORC and other columnar formats using HiveShim read all 
columns when doing a simple count 

    ## What changes were proposed in this pull request?
    
    When reading zero columns (e.g., count(*)) from ORC or any other format 
that uses HiveShim, actually set the read column list to empty for Hive to use.
    
    ## How was this patch tested?
    
    Query correctness is handled by existing unit tests. I'm happy to add more 
if anyone can point out some case that is not covered.
    
    Reduction in data read can be verified in the UI when built with a recent 
version of Hadoop say:
    ```
    build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.0 -Phive -DskipTests 
clean package
    ```
    However the default Hadoop 2.2 that is used for unit tests does not report 
actual bytes read and instead just full file sizes (see FileScanRDD.scala line 
80). Therefore I don't think there is a good way to add a unit test for this.
    
    I tested with the following setup using above build options
    ```
    case class OrcData(intField: Long, stringField: String)
    spark.range(1,1000000).map(i => OrcData(i, 
s"part-$i")).toDF().write.format("orc").save("orc_test")
    
    sql(
          s"""CREATE EXTERNAL TABLE orc_test(
             |  intField LONG,
             |  stringField STRING
             |)
             |STORED AS ORC
             |LOCATION '${System.getProperty("user.dir") + "/orc_test"}'
           """.stripMargin)
    ```
    
    ## Results
    
    query | Spark 2.0.2 | this PR
    ---|---|---
    `sql("select count(*) from orc_test").collect`|4.4 MB|199.4 KB
    `sql("select intField from orc_test").collect`|743.4 KB|743.4 KB
    `sql("select * from orc_test").collect`|4.4 MB|4.4 MB


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/aray/spark sql-orc-no-col

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15898.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15898
    
----
commit d6dbd479ed382049ab80fe92558550a26277431e
Author: Andrew Ray <[email protected]>
Date:   2016-03-29T12:33:12Z

    meh

commit a4d1ce3e1e7b8602860d890ff3266ef464899a9b
Author: Andrew Ray <[email protected]>
Date:   2016-11-14T15:57:46Z

    Merge branch 'master' of https://github.com/apache/spark into sql-orc-no-col

commit 037ca1d7d2765c5104b90cb3fa623ca1bb24480d
Author: Andrew Ray <[email protected]>
Date:   2016-11-16T03:57:00Z

    update comment to be consistent with logic

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #15898: [SPARK-18457][SQL] ORC and other columnar formats...

Reply via email to