Re: Cache sparkSql data without uncompressing it in memory

Cheng Lian Thu, 13 Nov 2014 19:51:59 -0800

No, the columnar buffer is built in a small batching manner, the batchsize is controlled by the |spark.sql.inMemoryColumnarStorage.batchSize|property. The default value for this in master and branch-1.2 is 10,000rows per batch.


On 11/14/14 1:27 AM, Sadhan Sood wrote:

Thanks Chneg, Just one more question - does that mean that we stillneed enough memory in the cluster to uncompress the data before it canbe compressed again or does that just read the raw data as is?

On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian <lian.cs....@gmail.com<mailto:lian.cs....@gmail.com>> wrote:


    Currently there’s no way to cache the compressed sequence file
    directly. Spark SQL uses in-memory columnar format while caching
    table rows, so we must read all the raw data and convert them into
    columnar format. However, you can enable in-memory columnar
    compression by setting
    |spark.sql.inMemoryColumnarStorage.compressed| to |true|. This
    property is already set to true by default in master branch and
    branch-1.2.

    On 11/13/14 7:16 AM, Sadhan Sood wrote:

    We noticed while caching data from our hive tables which contain
    data in compressed sequence file format that it gets uncompressed
    in memory when getting cached. Is there a way to turn this off
    and cache the compressed data as is ?

Re: Cache sparkSql data without uncompressing it in memory

Reply via email to