[I] [Enhancement] Optimize Netty thread count and warmMappedFile OSR Problem [rocketmq]

via GitHub Wed, 04 Jun 2025 17:00:21 -0700


AbcCreatorDeep opened a new issue, #9449:
URL: https://github.com/apache/rocketmq/issues/9449


   ### Before Creating the Enhancement Request
   
   - [x] I have confirmed that this should be classified as an enhancement 
rather than a bug/feature.
   
   
   ### Summary
   
   In Netty, there are two EventExecutorChoosers, GenericEventExecutorChooser 
and PowerOfTwoEventExecutorChooser. When the number of threads is set to 3, 
GenericEventExecutorChooser will be used, which will cause performance to 
decrease as the number of calls increases，This is because the delay of the AND 
instruction is 1 nanosecond, while the delay of IDIV is 60-80 nanoseconds (as 
described in the Intel Performance Optimization Manual).
   
   At the same time, in warmMapedFile, there may be a problem where the C2 
compiler optimizes the for loop to CountedLoop during OSR, which makes it 
impossible to detect thread safety points and prevents VMthread from STW. 
Therefore, the for loop value is modified to long, and a strong conversion is 
performed during internal put to avoid this issue. Only by breaking the pre 
detection condition of CountedLoop: For requires triple detection, it can meet 
the requirements
   
   ### Motivation
   
   1. Removing the IDIV instruction significantly improves Netty's speed in 
selecting the next thread
   2. Remove the 'for' loop from warmAppedFile and set the number of loops to 
long to avoid forced conversion
   
   The following code is a test on the x86 platform, and the test results show 
that as the number of cycles increases, the performance significantly decreases：
   
   public class PowerOfTwoTest {
       public static Object[] powerOfTwoObjects = new Object[16];
       public static Object[] nonPowerOfTwoObjects = new Object[15];
       public static long powerOfTwoObjectsIdx;
       public static long nonPowerOfTwoObjectsIdx;
       public static long loopCount = 9999999999L;
   
       public static final int powerOfTwoObjectsMask = powerOfTwoObjects.length 
- 1;
   
       public static Object testPowerOfTwo() {
           return powerOfTwoObjects[(int) powerOfTwoObjectsIdx++ & 
powerOfTwoObjectsMask];
       }
   
       public static Object testNonPowerOfTwoObjects() {
           return nonPowerOfTwoObjects[(int) Math.abs(nonPowerOfTwoObjectsIdx++ 
% nonPowerOfTwoObjects.length)];
       }
   
   
       public static void main(String[] args) {
           MyUtils.Watch watch = MyUtils.newWatch();
           watch.start();
           for (long i = 0; i < loopCount; i++) {
               testPowerOfTwo();
           }
           watch.stop();
           watch.printUseTime();
   
   
           watch.start();
           for (long i = 0; i < loopCount; i++) {
               testNonPowerOfTwoObjects();
           }
           watch.stop();
           watch.printUseTime();
       }
   }
   
   result： 
   4312ms
   19811ms
   
   For the OSR issue, I have extracted it into the following code, which can be 
directly tested and reproduced：
   
   public class OSRDemo {
       static int counter_1;
   
   
       public static void main(String[] var0) throws Exception {
           System.out.println("main start");
   
           startBusinessThread();
           startProblemThread();
   
           Thread.sleep(1000L);
   
           System.out.println("start gc");
           System.gc(); // Trigger STW
           System.out.println("main end");
       }
   
       // Possible reason for not detecting thread safety points due to OSR
       public static void startProblemThread() {
           new Thread(() -> {
               // Here, you can separate int i and j, or set them to short to 
complete semantics
               for (int i = 0; i < 8888888; i++) {
                   for (int j = 0; j < 8888888; j++) {
                       counter_1 += counter_1 % 3 % 33 % 3333 % 33333;
                       counter_1 += counter_1 % 3 % 33 % 3333 % 33333;
                   }
               }
   
           }).start();
       }
   
       public static void startBusinessThread() {
           (new Thread(() -> {
               System.out.println("Business Thread - 1 start");
   
               while (true) {
                   System.out.println("Execute Business 1");
   
                   try {
                       Thread.sleep(1000L);
                   } catch (InterruptedException e) {
                       e.printStackTrace();
                   }
               }
           })).start();
   
           (new Thread(() -> {
               System.out.println("Business Thread - 2 start");
   
               while (true) {
                   System.out.println("Execute Business 2");
   
                   try {
                       Thread.sleep(1000L);
                   } catch (InterruptedException e) {
                       e.printStackTrace();
                   }
               }
           })).start();
       }
   }
   
   
   
   
   The following code is used by the C2 compiler to optimize OSR and determine 
whether to generate thread safety points. In the first judgment, variables can 
be extracted to avoid being optimized as CountedLoop and removing the code for 
detecting safety points
   
   bool PhaseIdealLoop::is_counted_loop( Node *x, IdealLoopTree *loop ) { 
     if (x->in(LoopNode::Self) == NULL || x->req() != 3 || loop->_irreducible) 
{ // Extracting the initialization list of for variables will break the second 
condition
       return false;
     }
     // omit...
     if (cmp_op != Op_CmpI)  // Setting it as' long 'will break the conditions 
here, but it will also disrupt the original computational semantics and avoid 
strong conversions, resulting in performance loss
       return false;        
   }
   
   ### Describe the Solution You'd Like
   
   // Removing the IDIV instruction 
   public class NettyServerConfig implements Cloneable {
           private int serverSelectorThreads = 4;
   }
   
   
   
   // Remove the 'for' loop from warmAppedFile
   public void warmMappedFile(FlushDiskType type, int pages) {
           this.mappedByteBufferAccessCountSinceLastSwap++;
   
           long beginTime = System.currentTimeMillis();
           ByteBuffer byteBuffer = this.mappedByteBuffer.slice();
           long flush = 0;
           int i = 0, j = 0; // Breaking the triple check of C2 compiler 
CountedLoop optimization to avoid thread safety issues，reference JVM source 
code：PhaseIdealLoop::is_counted_loop
           for (; i < this.fileSize; i += DefaultMappedFile.OS_PAGE_SIZE, j++) {
               byteBuffer.put(i, (byte) 0);
               // force flush when flush disk type is sync
               if (type == FlushDiskType.SYNC_FLUSH) {
                   if ((i / OS_PAGE_SIZE) - (flush / OS_PAGE_SIZE) >= pages) {
                       flush = i;
                       mappedByteBuffer.force();
                   }
               }
           }
   }
   
   ### Describe Alternatives You've Considered
   
   From the source code, it can be seen that only when the comparison value is 
int, it will be optimized to CountedLoop. If the number of loops is within the 
range of short, the initialization list parameter of for can also be modified 
to short type：
   
       for (short i = 0, j = 0; i < this.fileSize; i += 
DefaultMappedFile.OS_PAGE_SIZE, j++) {
               byteBuffer.put((int) i, (byte) 0);
               // force flush when flush disk type is sync
               if (type == FlushDiskType.SYNC_FLUSH) {
                   if ((i / OS_PAGE_SIZE) - (flush / OS_PAGE_SIZE) >= pages) {
                       flush = i;
                       mappedByteBuffer.force();
                   }
               }
           }
   
   
   
   bool PhaseIdealLoop::is_counted_loop( Node *x, IdealLoopTree *loop ) { 
     if (x->in(LoopNode::Self) == NULL || x->req() != 3 || loop->_irreducible) 
{ // Extracting the initialization list of for variables will break the second 
condition
       return false;
     }
     // omit...
     if (cmp_op != Op_CmpI)  // Setting it as' long 'will break the conditions 
here, but it will also disrupt the original computational semantics and avoid 
strong conversions, resulting in performance loss
       return false;        
   }
   
   ### Additional Context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Enhancement] Optimize Netty thread count and warmMappedFile OSR Problem [rocketmq]

Reply via email to