Hi Xuchuanyin, I think this is a good proposal. GC is performance killer for carbon since carbon need to sort data when loading.
Regards, Jacky > 在 2017年12月20日,上午11:47,徐传印 <[email protected]> 写道: > > Hi, dev: > > PS: Sorry for the bad format in previous two letters, please refer to this > one. > > Recently I found the bug in compressing sort temp file and tried to fix this > bug in PR#1632 (https://github.com/apache/carbondata/pull/1632). In this PR, > Carbondata will compress the records in batch and write the compressed > content to file if we turn on this feature. However, I found that the GC > performance is terrible. In my scenario, about half of the time were wasted > in GC. And the overall performance is worse than before. > > I think the problem may lie in compressing the records by batch. Instead of > this, I propose to compress the sort temp file in file level, not in > record-batch level. > > 1. Compared with uncompressed ones, compressing the file in record-batch > level leads to different layout of file. And it also affects the > reading/writing behavior. > > (The compressed: > |total_entry_number|batch_entry_numer|compressed_length|compressed_content|batch_entry_numer|compressed_length|compressed_content|...; > > The uncompressed: |total_entry_number|record|record|...;) > > 2. During compressing/uncompressing the record-batch, we have to store the > bytes in temporary memory. If the size is big, it directly goes into JVM old > generation, which will cause FULL GC frequently. I also tried to reuse this > temporary memory, but it can only be reusable in file level -- We need to > allocate the memory for each file. If the number of intermediate files are > big, frequent FULL GC is still inevitable. > > If the size is small, we will need to store more > `batch_entry_numer`(described in point1). > > Note that, the size is rowSize*batchSize. In previous implementation, > Carbondata use 2MB bytes to store one row. > > 3. Using file level compression will simply the code since CompressedStream > is also an Stream, which will not affect the behavior in reading/writing > compressed/uncompressed files. > > 4. After I used file level compression, the GC problem disappeared. Since my > cluster has crashed, I didn't get the actual performace enhanced. But seeing > from the Carbondata maven tests, the most time consuming module `Spark Common > Test` takes less time to complete comparing with uncompressed. > > Time consumed in `Spark Common Test` module: > > | Compressor | Time Consumed | > | --- | --- | > | None | 19:25min | > | SNAPPY | 18:38min | > | LZ4 | 19:12min | > | GZIP | 20:32min | > | BZIP2 | 21:10min | > > > In conclusion, I think file level compression is better and I plan to remove > the record-batch leve compression related code in Carbondata.
