Hi Jacky, As per the original PR https://github.com/apache/carbondata/pull/2628 , query performance got decreased by 20% ~ 50% compared to snappy. So I am concerned about the performance. Please better have a proper tpch performance report on the regular cluster like we do for every version and decide based on that.
Regards, Ravindra. On Fri, 7 Feb 2020 at 10:40 AM, Jacky Li <[email protected]> wrote: > Hi Ajantha, > > > Yes, decoder will use the compressorName stored in ChunkCompressionMeta > from the file header, > but I think it is better to put it in the name so that user can know the > compressor in the shell without reading it by launching engine. > > > In spark, for parquet/orc the file name written > is: part-00115-e2758995-4b10-4bd2-bf15-b4c176e587fe-c000.snappy.orc > > > In PR3606, I will handle the compatibility. > > > Regards, > Jacky > > > ------------------ 原始邮件 ------------------ > 发件人: "Ajantha Bhat"<[email protected]>; > 发送时间: 2020年2月6日(星期四) 晚上11:51 > 收件人: "dev"<[email protected]>; > > 主题: Re: Discussion: change default compressor to ZSTD > > > > Hi, > > 33% is huge a reduction in store size. If there is negligible difference in > load and query time, we should definitely go for it. > > And does user really need to know about what compression is used ? change > in file name may be need to handle compatibility. > Already thrift *FileHeader, ChunkCompressionMeta* is storing the compressor > name. query time decoding can be based on this. > > Thanks, > Ajantha > > > On Thu, Feb 6, 2020 at 4:27 PM Jacky Li <[email protected]> wrote: > > > Hi, > > > > > > I compared snappy and zstd compressor using TPCH for carbondata. > > > > > > For TPCH lineitem table: > > carbon-zstdcarbon-snappy > > loading (s)5351 > > size795MB1.2GB > > > > TPCH-query: > > Q14.2898.29 > > Q212.60912.986 > > Q314.90214.458 > > Q46.2765.954 > > Q523.14721.946 > > Q61.120.945 > > Q723.01728.007 > > Q814.55415.077 > > Q928.47227.473 > > Q1024.06724.682 > > Q113.3213.79 > > Q125.3115.185 > > Q1314.0811.84 > > Q142.2622.087 > > Q155.4964.772 > > Q1629.91929.833 > > Q177.0187.057 > > Q1817.36717.795 > > Q192.9312.865 > > Q2011.34710.937 > > Q2126.41628.414 > > Q225.9236.311 > > sum283.844290.704 > > > > > > As you can see, after using zstd, table size is 33% reduced comparing > to > > snappy. And the data loading and query time difference is negligible. > So I > > suggest to change the default compressor in carbondata from snappy to > zstd. > > > > > > To change the default compressor, we need to: > > 1. append the compressor name in the carbondata file name. So that > from > > the file name user can know what compressor is used. > > For example, file name will be changed from > > &nbsp;part-0-0_batchno0-0-0-1580982686749.carbondata > > > to&nbsp;&nbsp;part-0-0_batchno0-0-0-1580982686749.snappy.carbondata > > > or&nbsp;&nbsp;part-0-0_batchno0-0-0-1580982686749.zstd.carbondata > > > > > > 2. Change the compressor constant in CarbonCommonConstaint.java file > to > > use zstd as default compressor > > > > > > What do you think? > > > > > > Regards, > > Jacky -- Thanks & Regards, Ravi
