[GitHub] [kafka] yuzawa-san commented on pull request #9499: KAFKA-10470: Zstd upgrade and buffering

GitBox Mon, 26 Oct 2020 08:24:48 -0700


yuzawa-san commented on pull request #9499:
URL: https://github.com/apache/kafka/pull/9499#issuecomment-716620712



   @chia7712  @ijuma  The main reason I'm holding off using the BufferSupplier 
is because of part of the  "after" benchmark looks like "before" benchmark:
   ```
   Benchmark                                                                    
                                                (bufferSupplierStr)  (bytes)  
(compressionType)  (maxBatchSize)  (messageSize)  (messageVersion)   Mode  Cnt  
     Score      Error   Units
   
CompressedRecordBatchValidationBenchmark.measureValidateMessagesAndAssignOffsetsCompressed
                                                CREATE   RANDOM               
ZSTD             200           1000                 2  thrpt   15   26302.815 ± 
2849.585   ops/s
   
CompressedRecordBatchValidationBenchmark.measureValidateMessagesAndAssignOffsetsCompressed:·gc.alloc.rate
                                 CREATE   RANDOM               ZSTD             
200           1000                 2  thrpt   15    3613.977 ±  391.633  MB/sec
   
CompressedRecordBatchValidationBenchmark.measureValidateMessagesAndAssignOffsetsCompressed:·gc.alloc.rate.norm
                            CREATE   RANDOM               ZSTD             200  
         1000                 2  thrpt   15  151328.008 ±    0.012    B/op
   
CompressedRecordBatchValidationBenchmark.measureValidateMessagesAndAssignOffsetsCompressed:·gc.churn.G1_Eden_Space
                        CREATE   RANDOM               ZSTD             200      
     1000                 2  thrpt   15    3638.121 ±  394.668  MB/sec
   
CompressedRecordBatchValidationBenchmark.measureValidateMessagesAndAssignOffsetsCompressed:·gc.churn.G1_Eden_Space.norm
                   CREATE   RANDOM               ZSTD             200           
1000                 2  thrpt   15  152340.806 ±  614.236    B/op
   
CompressedRecordBatchValidationBenchmark.measureValidateMessagesAndAssignOffsetsCompressed:·gc.churn.G1_Survivor_Space
                    CREATE   RANDOM               ZSTD             200          
 1000                 2  thrpt   15     164.799 ±   16.872  MB/sec
   
CompressedRecordBatchValidationBenchmark.measureValidateMessagesAndAssignOffsetsCompressed:·gc.churn.G1_Survivor_Space.norm
               CREATE   RANDOM               ZSTD             200           
1000                 2  thrpt   15    6907.707 ±  146.892    B/op
   
CompressedRecordBatchValidationBenchmark.measureValidateMessagesAndAssignOffsetsCompressed:·gc.count
                                      CREATE   RANDOM               ZSTD        
     200           1000                 2  thrpt   15    1602.000             
counts
   
CompressedRecordBatchValidationBenchmark.measureValidateMessagesAndAssignOffsetsCompressed:·gc.time
                                       CREATE   RANDOM               ZSTD       
      200           1000                 2  thrpt   15   80186.000              
   ms
   JMH benchmarks done
   ```
   
   This is slow still because changes like these 
https://github.com/apache/kafka/pull/9229/files#diff-8c632aabf49c81fcae204e3cb479ae861c8243b721ebb2024b96faee468cc9bbL105
 and 
https://github.com/apache/kafka/pull/9229/files#diff-8c632aabf49c81fcae204e3cb479ae861c8243b721ebb2024b96faee468cc9bbL401
 are not merged, so currently if the BufferSupplier were to be used it would 
not actually be recycling buffers in certain cases, in particular within the 
LogValidator as indicated in that benchmark above. NOTE: the 
RecordBatchIterationBenchmark looks basically the same since that code uses the 
BufferSupplier provided by the JMH harness, so recycling is actually occurring. 
The CompressedRecordBatchValidationBenchmark does not and fixing that is the 
crux of that other PR which is indeed correct.
   
   Given the snappy recycler implementation is currently acceptable and that 
zstd-jni uses a very similar one, I propose we lean on zstd-jni's default 
recycler for the time being (to guarantee buffer recycling regardless of the 
type of BufferSupplier it gets) and only after 
https://github.com/apache/kafka/pull/9229 or something like it is merged then 
we revisit using the BufferSupplier in zstd. I agree in the long term that is 
the best case, but I don't want a half-fix when we can have a fix here that 
gets us 98% of the way there. I an eager to get a fast zstd implementation in 
since we have very promising results but had to stop using it because of the 
negative performance impact.
   
   Additionally there are some open questions (mostly beyond the scope of this 
bugfix):
   * The wrapForOutput does not have a BufferSupplier, so where does it come 
from: static constant/singleton, somewhere else, break the API, do we just lean 
on the default zstd-jni recycler?
   * Creating the wrapper around BufferSupplier that implements zstd-jni's 
BufferPool means implementing BufferPool but the checkstyle prohbits this. 
Would we be allowed to import it? I see jpountz lz4 is allowed.
   * Do we want to do the recycling for snappy? I see see SnappyOutputStream 
seems to allow for one to pass in a BufferAllocatorFactory implementation which 
may be interoperable with BufferSupplier. However this may be moot due to my 
first point. The SnappyInputStream does not seem to accept any recyclers. There 
are SnappyFramedInputStream and SnappyFramedOutputStream which seem to accept 
another snappy-specific recycler "BufferPool". Maybe those could be promising?
   * Do we want to do the recycling for GZIP? GZIP seems like a lost cause 
since that stuff related is buffers is hidden within the standard library, 
unless it is ported over like the LZ4 streams. 
   * Do we want to create versions of BufferedInputStream and 
BufferedOutputStream which use BufferSupplier?
   * Does any of this work interfere with any future work related to passing in 
compressor configuration parameters?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [kafka] yuzawa-san commented on pull request #9499: KAFKA-10470: Zstd upgrade and buffering

Reply via email to