Re: Discussion: change default compressor to ZSTD

Ravindra Pesala Fri, 07 Feb 2020 03:24:49 -0800

Hi Jacky,

As per the original PR
https://github.com/apache/carbondata/pull/2628 , query performance got
decreased by 20% ~ 50% compared to snappy.  So I am concerned about the
performance. Please better have a proper tpch performance report on the
regular cluster like we do for every version and decide based on that.


Regards,
Ravindra.

On Fri, 7 Feb 2020 at 10:40 AM, Jacky Li <[email protected]> wrote:

> Hi Ajantha,
>
>
> Yes, decoder will use the compressorName stored in ChunkCompressionMeta
> from the file header,
> but I think it is better to put it in the name so that user can know the
> compressor in the shell without reading it by launching engine.
>
>
> In spark, for parquet/orc the file name written
> is:&nbsp;part-00115-e2758995-4b10-4bd2-bf15-b4c176e587fe-c000.snappy.orc
>
>
> In PR3606, I will handle the compatibility.
>
>
> Regards,
> Jacky
>
>
> ------------------&nbsp;原始邮件&nbsp;------------------
> 发件人:&nbsp;"Ajantha Bhat"<[email protected]&gt;;
> 发送时间:&nbsp;2020年2月6日(星期四) 晚上11:51
> 收件人:&nbsp;"dev"<[email protected]&gt;;
>
> 主题:&nbsp;Re: Discussion: change default compressor to ZSTD
>
>
>
> Hi,
>
> 33% is huge a reduction in store size. If there is negligible difference in
> load and query time, we should definitely go for it.
>
> And does user really need to know about what compression is used ? change
> in file name may be need to handle compatibility.
> Already thrift *FileHeader, ChunkCompressionMeta* is storing the compressor
> name. query time decoding can be based on this.
>
> Thanks,
> Ajantha
>
>
> On Thu, Feb 6, 2020 at 4:27 PM Jacky Li <[email protected]&gt; wrote:
>
> &gt; Hi,
> &gt;
> &gt;
> &gt; I compared snappy and zstd compressor using TPCH for carbondata.
> &gt;
> &gt;
> &gt; For TPCH lineitem table:
> &gt; carbon-zstdcarbon-snappy
> &gt; loading (s)5351
> &gt; size795MB1.2GB
> &gt;
> &gt; TPCH-query:
> &gt; Q14.2898.29
> &gt; Q212.60912.986
> &gt; Q314.90214.458
> &gt; Q46.2765.954
> &gt; Q523.14721.946
> &gt; Q61.120.945
> &gt; Q723.01728.007
> &gt; Q814.55415.077
> &gt; Q928.47227.473
> &gt; Q1024.06724.682
> &gt; Q113.3213.79
> &gt; Q125.3115.185
> &gt; Q1314.0811.84
> &gt; Q142.2622.087
> &gt; Q155.4964.772
> &gt; Q1629.91929.833
> &gt; Q177.0187.057
> &gt; Q1817.36717.795
> &gt; Q192.9312.865
> &gt; Q2011.34710.937
> &gt; Q2126.41628.414
> &gt; Q225.9236.311
> &gt; sum283.844290.704
> &gt;
> &gt;
> &gt; As you can see, after using zstd, table size is 33% reduced comparing
> to
> &gt; snappy. And the data loading and query time difference is negligible.
> So I
> &gt; suggest to change the default compressor in carbondata from snappy to
> zstd.
> &gt;
> &gt;
> &gt; To change the default compressor, we need to:
> &gt; 1. append the compressor name in the carbondata file name. So that
> from
> &gt; the file name user can know what compressor is used.
> &gt; For example, file name will be changed from
> &gt; &amp;nbsp;part-0-0_batchno0-0-0-1580982686749.carbondata
> &gt;
> to&amp;nbsp;&amp;nbsp;part-0-0_batchno0-0-0-1580982686749.snappy.carbondata
> &gt;
> or&amp;nbsp;&amp;nbsp;part-0-0_batchno0-0-0-1580982686749.zstd.carbondata
> &gt;
> &gt;
> &gt; 2. Change the compressor constant in CarbonCommonConstaint.java file
> to
> &gt; use zstd as default compressor
> &gt;
> &gt;
> &gt; What do you think?
> &gt;
> &gt;
> &gt; Regards,
> &gt; Jacky

-- 
Thanks & Regards,
Ravi

Re: Discussion: change default compressor to ZSTD

Reply via email to