Re: [Discussion]Compression for sort temp files in Carbomdata

Jacky Li Sat, 23 Dec 2017 00:24:45 -0800

Hi Xuchuanyin,

I think this is a good proposal. GC is performance killer for carbon since 
carbon need to sort data when loading.


Regards,
Jacky

> 在 2017年12月20日，上午11:47，徐传印 <[email protected]> 写道：
> 
> Hi, dev:
> 
> PS: Sorry for the bad format in previous two letters, please refer to this 
> one.
> 
> Recently I found the bug in compressing sort temp file and tried to fix this 
> bug in PR#1632 (https://github.com/apache/carbondata/pull/1632). In this PR, 
> Carbondata will compress the records in batch and write the compressed 
> content to file if we turn on this feature. However, I found that the GC 
> performance is terrible. In my scenario, about half of the time were wasted 
> in GC. And the overall performance is worse than before.
> 
> I think the problem may lie in compressing the records by batch. Instead of 
> this, I propose to compress the sort temp file in file level, not in 
> record-batch level.
> 
> 1. Compared with uncompressed ones, compressing the file in record-batch 
> level leads to different layout of file. And it also affects the 
> reading/writing behavior.
> 
> (The compressed: 
> |total_entry_number|batch_entry_numer|compressed_length|compressed_content|batch_entry_numer|compressed_length|compressed_content|...;
> 
> The uncompressed: |total_entry_number|record|record|...;)
> 
> 2. During compressing/uncompressing the record-batch, we have to store the 
> bytes in temporary memory. If the size is big, it directly goes into JVM old 
> generation, which will cause FULL GC frequently. I also tried to reuse this 
> temporary memory, but it can only be reusable in file level -- We need to 
> allocate the memory for each file. If the number of intermediate files are 
> big, frequent FULL GC is still inevitable.
> 
> If the size is small, we will need to store more 
> `batch_entry_numer`(described in point1).
> 
> Note that, the size is rowSize*batchSize. In previous implementation, 
> Carbondata use 2MB bytes to store one row.
> 
> 3. Using file level compression will simply the code since CompressedStream 
> is also an Stream, which will not affect the behavior in reading/writing 
> compressed/uncompressed files.
> 
> 4. After I used file level compression, the GC problem disappeared. Since my 
> cluster has crashed, I didn't get the actual performace enhanced. But seeing 
> from the Carbondata maven tests, the most time consuming module `Spark Common 
> Test` takes less time to complete comparing with uncompressed.
> 
> Time consumed in `Spark Common Test` module:
> 
> | Compressor | Time Consumed |
> | --- | --- |
> | None | 19:25min |
> | SNAPPY | 18:38min |
> | LZ4 | 19:12min |
> | GZIP | 20:32min |
> | BZIP2 | 21:10min |
> 
> 
> In conclusion, I think file level compression is better and I plan to remove 
> the record-batch leve compression related code in Carbondata.

Re: [Discussion]Compression for sort temp files in Carbomdata

Reply via email to