GitHub user jimfcarroll opened a pull request:

    https://github.com/apache/spark/pull/3254

    [SPARK-4386] Improve performance when writing Parquet files.

    If you profile the writing of a Parquet file, the single worst time 
consuming call inside of 
org.apache.spark.sql.parquet.MutableRowWriteSupport.write is actually in the 
scala.collection.AbstractSequence.size call. This is because the size call 
actually ends up COUNTING the elements in a 
scala.collection.LinearSeqOptimized.length ("optimized?").
    
    This doesn't need to be done. "size" is called repeatedly where needed 
rather than called once at the top of the method and stored in a 'val'.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jimfcarroll/spark parquet-perf

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3254.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3254
    
----
commit 30cc0b592789befb7e212783846624a8a4d4381f
Author: Jim Carroll <[email protected]>
Date:   2014-11-13T20:40:52Z

    Improve performance when writing Parquet files.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to