Re: RFR: 8273322: Enhance macro logic optimization for masked logic operations. [v3]
On Tue, 4 Jan 2022 15:11:47 GMT, Jatin Bhateja wrote: >> Patch extends existing macrologic inferencing algorithm to handle masked >> logic operations. >> >> Existing algorithm: >> >> 1. Identify logic cone roots. >> 2. Packs parent and logic child nodes into a MacroLogic node in bottom up >> traversal if input constraint are met. >> i.e. maximum number of inputs which a macro logic node can have. >> 3. Perform symbolic evaluation of logic expression tree by assigning value >> corresponding to a truth table column >> to each input. >> 4. Inputs along with encoded function together represents a macro logic node >> which mimics a truth table. >> >> Modification: >> Extended the packing algorithm to operate on both predicated or >> non-predicated logic nodes. Following >> rules define the criteria under which nodes gets packed into a macro logic >> node:- >> >> 1. Parent and both child nodes are all unmasked or masked with same >> predicates. >> 2. Masked parent can be packed with left child if it is predicated and both >> have same prediates. >> 3. Masked parent can be packed with right child if its un-predicated or has >> matching predication condition. >> 4. An unmasked parent can be packed with an unmasked child. >> >> New jtreg test case added with the patch exhaustively covers all the >> different combinations of predications of parent and >> child nodes. >> >> Following are the performance number for JMH benchmark included with the >> patch. >> >> Machine Configuration: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S >> Icelake Server) >> >> Benchmark | ARRAYLEN | Baseline (ops/s) | Withopt (ops/s) | Gain ( >> withopt/baseline) >> -- | -- | -- | -- | -- >> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 64 | 2365.421 | 5136.283 >> | 2.171403315 >> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 128 | 2034.1 | 4073.381 | >> 2.002547072 >> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 256 | 1568.694 | 2811.975 >> | 1.792558013 >> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 512 | 883.261 | 1662.771 >> | 1.882536419 >> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 1024 | 469.513 | 732.81 | >> 1.560787454 >> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 64 | 273.049 | 552.106 | >> 2.022003377 >> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 128 | 219.624 | 359.775 | >> 1.63814064 >> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 256 | 131.649 | 182.23 | >> 1.384211046 >> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 512 | 71.452 | 81.522 | >> 1.140933774 >> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 1024 | 37.427 | 41.966 | >> 1.121276084 >> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 64 | 2805.759 | 3383.16 | >> 1.205791374 >> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 128 | 2069.012 | 2250.37 >> | 1.087654397 >> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 256 | 1098.766 | 1101.996 >> | 1.002939661 >> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 512 | 470.035 | 484.732 | >> 1.031267884 >> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 1024 | 202.827 | 209.073 >> | 1.030794717 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 256 >> | 3435.989 | 4418.09 | 1.285827749 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 512 >> | 1524.803 | 1678.201 | 1.100601848 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | >> 1024 | 972.501 | 1166.734 | 1.199725244 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 256 >> | 5980.85 | 7584.17 | 1.268075608 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 512 >> | 3258.108 | 3939.23 | 1.209054457 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | >> 1024 | 1475.365 | 1511.159 | 1.024261115 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 256 >> | 4208.766 | 4220.678 | 1.002830283 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 512 >> | 2056.651 | 2049.489 | 0.99651764 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | >> 1024 | 1110.461 | 1116.448 | 1.005391455 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | >> 256 | 3259.348 | 3947.94 | 1.211266793 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | >> 512 | 1515.147 | 1536.647 | 1.014190042 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | >> 1024 | 911.58 | 1030.54 | 1.130498695 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | >> 256 | 2034.611 | 2073.764 | 1.019243482 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | >> 512 | 1110.659 | 1116.093 | 1.004892591 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | >> 1024 | 559.269 | 559.651 | 1.000683034 >> o.o.b.jdk.incubator.vector.MaskedLogicOpt
Re: RFR: 8273322: Enhance macro logic optimization for masked logic operations. [v3]
On Tue, 4 Jan 2022 15:11:47 GMT, Jatin Bhateja wrote: >> Patch extends existing macrologic inferencing algorithm to handle masked >> logic operations. >> >> Existing algorithm: >> >> 1. Identify logic cone roots. >> 2. Packs parent and logic child nodes into a MacroLogic node in bottom up >> traversal if input constraint are met. >> i.e. maximum number of inputs which a macro logic node can have. >> 3. Perform symbolic evaluation of logic expression tree by assigning value >> corresponding to a truth table column >> to each input. >> 4. Inputs along with encoded function together represents a macro logic node >> which mimics a truth table. >> >> Modification: >> Extended the packing algorithm to operate on both predicated or >> non-predicated logic nodes. Following >> rules define the criteria under which nodes gets packed into a macro logic >> node:- >> >> 1. Parent and both child nodes are all unmasked or masked with same >> predicates. >> 2. Masked parent can be packed with left child if it is predicated and both >> have same prediates. >> 3. Masked parent can be packed with right child if its un-predicated or has >> matching predication condition. >> 4. An unmasked parent can be packed with an unmasked child. >> >> New jtreg test case added with the patch exhaustively covers all the >> different combinations of predications of parent and >> child nodes. >> >> Following are the performance number for JMH benchmark included with the >> patch. >> >> Machine Configuration: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S >> Icelake Server) >> >> Benchmark | ARRAYLEN | Baseline (ops/s) | Withopt (ops/s) | Gain ( >> withopt/baseline) >> -- | -- | -- | -- | -- >> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 64 | 2365.421 | 5136.283 >> | 2.171403315 >> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 128 | 2034.1 | 4073.381 | >> 2.002547072 >> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 256 | 1568.694 | 2811.975 >> | 1.792558013 >> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 512 | 883.261 | 1662.771 >> | 1.882536419 >> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 1024 | 469.513 | 732.81 | >> 1.560787454 >> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 64 | 273.049 | 552.106 | >> 2.022003377 >> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 128 | 219.624 | 359.775 | >> 1.63814064 >> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 256 | 131.649 | 182.23 | >> 1.384211046 >> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 512 | 71.452 | 81.522 | >> 1.140933774 >> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 1024 | 37.427 | 41.966 | >> 1.121276084 >> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 64 | 2805.759 | 3383.16 | >> 1.205791374 >> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 128 | 2069.012 | 2250.37 >> | 1.087654397 >> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 256 | 1098.766 | 1101.996 >> | 1.002939661 >> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 512 | 470.035 | 484.732 | >> 1.031267884 >> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 1024 | 202.827 | 209.073 >> | 1.030794717 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 256 >> | 3435.989 | 4418.09 | 1.285827749 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 512 >> | 1524.803 | 1678.201 | 1.100601848 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | >> 1024 | 972.501 | 1166.734 | 1.199725244 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 256 >> | 5980.85 | 7584.17 | 1.268075608 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 512 >> | 3258.108 | 3939.23 | 1.209054457 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | >> 1024 | 1475.365 | 1511.159 | 1.024261115 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 256 >> | 4208.766 | 4220.678 | 1.002830283 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 512 >> | 2056.651 | 2049.489 | 0.99651764 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | >> 1024 | 1110.461 | 1116.448 | 1.005391455 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | >> 256 | 3259.348 | 3947.94 | 1.211266793 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | >> 512 | 1515.147 | 1536.647 | 1.014190042 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | >> 1024 | 911.58 | 1030.54 | 1.130498695 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | >> 256 | 2034.611 | 2073.764 | 1.019243482 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | >> 512 | 1110.659 | 1116.093 | 1.004892591 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | >> 1024 | 559.269 | 559.651 | 1.000683034 >> o.o.b.jdk.incubator.vector.MaskedLogicOpt
Re: RFR: 8273322: Enhance macro logic optimization for masked logic operations. [v3]
On Tue, 4 Jan 2022 15:11:47 GMT, Jatin Bhateja wrote: >> Patch extends existing macrologic inferencing algorithm to handle masked >> logic operations. >> >> Existing algorithm: >> >> 1. Identify logic cone roots. >> 2. Packs parent and logic child nodes into a MacroLogic node in bottom up >> traversal if input constraint are met. >> i.e. maximum number of inputs which a macro logic node can have. >> 3. Perform symbolic evaluation of logic expression tree by assigning value >> corresponding to a truth table column >> to each input. >> 4. Inputs along with encoded function together represents a macro logic node >> which mimics a truth table. >> >> Modification: >> Extended the packing algorithm to operate on both predicated or >> non-predicated logic nodes. Following >> rules define the criteria under which nodes gets packed into a macro logic >> node:- >> >> 1. Parent and both child nodes are all unmasked or masked with same >> predicates. >> 2. Masked parent can be packed with left child if it is predicated and both >> have same prediates. >> 3. Masked parent can be packed with right child if its un-predicated or has >> matching predication condition. >> 4. An unmasked parent can be packed with an unmasked child. >> >> New jtreg test case added with the patch exhaustively covers all the >> different combinations of predications of parent and >> child nodes. >> >> Following are the performance number for JMH benchmark included with the >> patch. >> >> Machine Configuration: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S >> Icelake Server) >> >> Benchmark | ARRAYLEN | Baseline (ops/s) | Withopt (ops/s) | Gain ( >> withopt/baseline) >> -- | -- | -- | -- | -- >> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 64 | 2365.421 | 5136.283 >> | 2.171403315 >> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 128 | 2034.1 | 4073.381 | >> 2.002547072 >> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 256 | 1568.694 | 2811.975 >> | 1.792558013 >> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 512 | 883.261 | 1662.771 >> | 1.882536419 >> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 1024 | 469.513 | 732.81 | >> 1.560787454 >> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 64 | 273.049 | 552.106 | >> 2.022003377 >> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 128 | 219.624 | 359.775 | >> 1.63814064 >> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 256 | 131.649 | 182.23 | >> 1.384211046 >> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 512 | 71.452 | 81.522 | >> 1.140933774 >> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 1024 | 37.427 | 41.966 | >> 1.121276084 >> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 64 | 2805.759 | 3383.16 | >> 1.205791374 >> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 128 | 2069.012 | 2250.37 >> | 1.087654397 >> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 256 | 1098.766 | 1101.996 >> | 1.002939661 >> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 512 | 470.035 | 484.732 | >> 1.031267884 >> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 1024 | 202.827 | 209.073 >> | 1.030794717 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 256 >> | 3435.989 | 4418.09 | 1.285827749 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 512 >> | 1524.803 | 1678.201 | 1.100601848 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | >> 1024 | 972.501 | 1166.734 | 1.199725244 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 256 >> | 5980.85 | 7584.17 | 1.268075608 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 512 >> | 3258.108 | 3939.23 | 1.209054457 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | >> 1024 | 1475.365 | 1511.159 | 1.024261115 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 256 >> | 4208.766 | 4220.678 | 1.002830283 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 512 >> | 2056.651 | 2049.489 | 0.99651764 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | >> 1024 | 1110.461 | 1116.448 | 1.005391455 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | >> 256 | 3259.348 | 3947.94 | 1.211266793 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | >> 512 | 1515.147 | 1536.647 | 1.014190042 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | >> 1024 | 911.58 | 1030.54 | 1.130498695 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | >> 256 | 2034.611 | 2073.764 | 1.019243482 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | >> 512 | 1110.659 | 1116.093 | 1.004892591 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | >> 1024 | 559.269 | 559.651 | 1.000683034 >> o.o.b.jdk.incubator.vector.MaskedLogicOpt
Re: RFR: 8273322: Enhance macro logic optimization for masked logic operations. [v3]
On Tue, 4 Jan 2022 15:11:47 GMT, Jatin Bhateja wrote: >> Patch extends existing macrologic inferencing algorithm to handle masked >> logic operations. >> >> Existing algorithm: >> >> 1. Identify logic cone roots. >> 2. Packs parent and logic child nodes into a MacroLogic node in bottom up >> traversal if input constraint are met. >> i.e. maximum number of inputs which a macro logic node can have. >> 3. Perform symbolic evaluation of logic expression tree by assigning value >> corresponding to a truth table column >> to each input. >> 4. Inputs along with encoded function together represents a macro logic node >> which mimics a truth table. >> >> Modification: >> Extended the packing algorithm to operate on both predicated or >> non-predicated logic nodes. Following >> rules define the criteria under which nodes gets packed into a macro logic >> node:- >> >> 1. Parent and both child nodes are all unmasked or masked with same >> predicates. >> 2. Masked parent can be packed with left child if it is predicated and both >> have same prediates. >> 3. Masked parent can be packed with right child if its un-predicated or has >> matching predication condition. >> 4. An unmasked parent can be packed with an unmasked child. >> >> New jtreg test case added with the patch exhaustively covers all the >> different combinations of predications of parent and >> child nodes. >> >> Following are the performance number for JMH benchmark included with the >> patch. >> >> Machine Configuration: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S >> Icelake Server) >> >> Benchmark | ARRAYLEN | Baseline (ops/s) | Withopt (ops/s) | Gain ( >> withopt/baseline) >> -- | -- | -- | -- | -- >> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 64 | 2365.421 | 5136.283 >> | 2.171403315 >> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 128 | 2034.1 | 4073.381 | >> 2.002547072 >> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 256 | 1568.694 | 2811.975 >> | 1.792558013 >> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 512 | 883.261 | 1662.771 >> | 1.882536419 >> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 1024 | 469.513 | 732.81 | >> 1.560787454 >> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 64 | 273.049 | 552.106 | >> 2.022003377 >> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 128 | 219.624 | 359.775 | >> 1.63814064 >> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 256 | 131.649 | 182.23 | >> 1.384211046 >> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 512 | 71.452 | 81.522 | >> 1.140933774 >> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 1024 | 37.427 | 41.966 | >> 1.121276084 >> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 64 | 2805.759 | 3383.16 | >> 1.205791374 >> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 128 | 2069.012 | 2250.37 >> | 1.087654397 >> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 256 | 1098.766 | 1101.996 >> | 1.002939661 >> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 512 | 470.035 | 484.732 | >> 1.031267884 >> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 1024 | 202.827 | 209.073 >> | 1.030794717 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 256 >> | 3435.989 | 4418.09 | 1.285827749 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 512 >> | 1524.803 | 1678.201 | 1.100601848 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | >> 1024 | 972.501 | 1166.734 | 1.199725244 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 256 >> | 5980.85 | 7584.17 | 1.268075608 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 512 >> | 3258.108 | 3939.23 | 1.209054457 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | >> 1024 | 1475.365 | 1511.159 | 1.024261115 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 256 >> | 4208.766 | 4220.678 | 1.002830283 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 512 >> | 2056.651 | 2049.489 | 0.99651764 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | >> 1024 | 1110.461 | 1116.448 | 1.005391455 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | >> 256 | 3259.348 | 3947.94 | 1.211266793 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | >> 512 | 1515.147 | 1536.647 | 1.014190042 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | >> 1024 | 911.58 | 1030.54 | 1.130498695 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | >> 256 | 2034.611 | 2073.764 | 1.019243482 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | >> 512 | 1110.659 | 1116.093 | 1.004892591 >> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | >> 1024 | 559.269 | 559.651 | 1.000683034 >> o.o.b.jdk.incubator.vector.MaskedLogicOpt
Re: RFR: 8273322: Enhance macro logic optimization for masked logic operations. [v3]
> Patch extends existing macrologic inferencing algorithm to handle masked > logic operations. > > Existing algorithm: > > 1. Identify logic cone roots. > 2. Packs parent and logic child nodes into a MacroLogic node in bottom up > traversal if input constraint are met. > i.e. maximum number of inputs which a macro logic node can have. > 3. Perform symbolic evaluation of logic expression tree by assigning value > corresponding to a truth table column > to each input. > 4. Inputs along with encoded function together represents a macro logic node > which mimics a truth table. > > Modification: > Extended the packing algorithm to operate on both predicated or > non-predicated logic nodes. Following > rules define the criteria under which nodes gets packed into a macro logic > node:- > > 1. Parent and both child nodes are all unmasked or masked with same > predicates. > 2. Masked parent can be packed with left child if it is predicated and both > have same prediates. > 3. Masked parent can be packed with right child if its un-predicated or has > matching predication condition. > 4. An unmasked parent can be packed with an unmasked child. > > New jtreg test case added with the patch exhaustively covers all the > different combinations of predications of parent and > child nodes. > > Following are the performance number for JMH benchmark included with the > patch. > > Machine Configuration: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S > Icelake Server) > > Benchmark | ARRAYLEN | Baseline (ops/s) | Withopt (ops/s) | Gain ( > withopt/baseline) > -- | -- | -- | -- | -- > o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 64 | 2365.421 | 5136.283 | > 2.171403315 > o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 128 | 2034.1 | 4073.381 | > 2.002547072 > o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 256 | 1568.694 | 2811.975 > | 1.792558013 > o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 512 | 883.261 | 1662.771 | > 1.882536419 > o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 1024 | 469.513 | 732.81 | > 1.560787454 > o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 64 | 273.049 | 552.106 | > 2.022003377 > o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 128 | 219.624 | 359.775 | > 1.63814064 > o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 256 | 131.649 | 182.23 | > 1.384211046 > o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 512 | 71.452 | 81.522 | > 1.140933774 > o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 1024 | 37.427 | 41.966 | > 1.121276084 > o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 64 | 2805.759 | 3383.16 | > 1.205791374 > o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 128 | 2069.012 | 2250.37 | > 1.087654397 > o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 256 | 1098.766 | 1101.996 > | 1.002939661 > o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 512 | 470.035 | 484.732 | > 1.031267884 > o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 1024 | 202.827 | 209.073 | > 1.030794717 > o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 256 > | 3435.989 | 4418.09 | 1.285827749 > o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 512 > | 1524.803 | 1678.201 | 1.100601848 > o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 1024 > | 972.501 | 1166.734 | 1.199725244 > o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 256 > | 5980.85 | 7584.17 | 1.268075608 > o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 512 > | 3258.108 | 3939.23 | 1.209054457 > o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 1024 > | 1475.365 | 1511.159 | 1.024261115 > o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 256 > | 4208.766 | 4220.678 | 1.002830283 > o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 512 > | 2056.651 | 2049.489 | 0.99651764 > o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 1024 > | 1110.461 | 1116.448 | 1.005391455 > o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 256 > | 3259.348 | 3947.94 | 1.211266793 > o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 512 > | 1515.147 | 1536.647 | 1.014190042 > o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | > 1024 | 911.58 | 1030.54 | 1.130498695 > o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 256 > | 2034.611 | 2073.764 | 1.019243482 > o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 512 > | 1110.659 | 1116.093 | 1.004892591 > o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | > 1024 | 559.269 | 559.651 | 1.000683034 > o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt128 | 256 > | 3636.141 | 4446.505 | 1.222863745 > o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt128 | 512 > |