Re: RFR: 8273322: Enhance macro logic optimization for masked logic operations. [v4]

2022-01-06 Thread Jatin Bhateja
On Thu, 6 Jan 2022 17:39:20 GMT, Sandhya Viswanathan  
wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   8273322: Review comments resolution.
>
> test/hotspot/jtreg/compiler/vectorapi/TestMaskedMacroLogicVector.java line 26:
> 
>> 24: /**
>> 25:  * @test
>> 26:  * @bug 8273322
> 
> Needs  @key randomness as we use random number without a fixed seed here.
> Please see:
> https://openjdk.java.net/jtreg/faq.html#when-should-i-use-the-intermittent-or-randomness-keyword-in-a-test

DONE

-

PR: https://git.openjdk.java.net/jdk/pull/6893


Re: RFR: 8273322: Enhance macro logic optimization for masked logic operations. [v4]

2022-01-06 Thread Sandhya Viswanathan
On Wed, 5 Jan 2022 08:59:00 GMT, Jatin Bhateja  wrote:

>> Patch extends existing macrologic inferencing algorithm to handle masked 
>> logic operations.
>> 
>> Existing algorithm:
>> 
>> 1. Identify logic cone roots.
>> 2. Packs parent and logic child nodes into a MacroLogic node in bottom up 
>> traversal if input constraint are met.
>> i.e. maximum number of inputs which a macro logic node can have.
>> 3. Perform symbolic evaluation of logic expression tree by assigning value 
>> corresponding to a truth table column
>> to each input.
>> 4. Inputs along with encoded function together represents a macro logic node 
>> which mimics a truth table.
>> 
>> Modification:
>> Extended the packing algorithm to operate on both predicated or 
>> non-predicated logic nodes. Following
>> rules define the criteria under which nodes gets packed into a macro logic 
>> node:-
>> 
>> 1. Parent and both child nodes are all unmasked or masked with same 
>> predicates.
>> 2. Masked parent can be packed with left child if it is predicated and both 
>> have same prediates.
>> 3. Masked parent can be packed with right child if its un-predicated or has 
>> matching predication condition.
>> 4. An unmasked parent can be packed with an unmasked child.
>> 
>> New jtreg test case added with the patch exhaustively covers all the 
>> different combinations of predications of parent and
>> child nodes.
>> 
>> Following are the performance number for JMH benchmark included with the 
>> patch.
>> 
>> Machine Configuration:  Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S 
>> Icelake Server)
>> 
>> Benchmark | ARRAYLEN | Baseline (ops/s) | Withopt (ops/s) | Gain ( 
>> withopt/baseline)
>> -- | -- | -- | -- | --
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 64 | 2365.421 | 5136.283 
>> | 2.171403315
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 128 | 2034.1 | 4073.381 | 
>> 2.002547072
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 256 | 1568.694 | 2811.975 
>> | 1.792558013
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 512 | 883.261 | 1662.771 
>> | 1.882536419
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 1024 | 469.513 | 732.81 | 
>> 1.560787454
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 64 | 273.049 | 552.106 | 
>> 2.022003377
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 128 | 219.624 | 359.775 | 
>> 1.63814064
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 256 | 131.649 | 182.23 | 
>> 1.384211046
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 512 | 71.452 | 81.522 | 
>> 1.140933774
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 1024 | 37.427 | 41.966 | 
>> 1.121276084
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 64 | 2805.759 | 3383.16 | 
>> 1.205791374
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 128 | 2069.012 | 2250.37 
>> | 1.087654397
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 256 | 1098.766 | 1101.996 
>> | 1.002939661
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 512 | 470.035 | 484.732 | 
>> 1.031267884
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 1024 | 202.827 | 209.073 
>> | 1.030794717
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 256 
>> | 3435.989 | 4418.09 | 1.285827749
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 512 
>> | 1524.803 | 1678.201 | 1.100601848
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 
>> 1024 | 972.501 | 1166.734 | 1.199725244
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 256 
>> | 5980.85 | 7584.17 | 1.268075608
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 512 
>> | 3258.108 | 3939.23 | 1.209054457
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 
>> 1024 | 1475.365 | 1511.159 | 1.024261115
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 256 
>> | 4208.766 | 4220.678 | 1.002830283
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 512 
>> | 2056.651 | 2049.489 | 0.99651764
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 
>> 1024 | 1110.461 | 1116.448 | 1.005391455
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 
>> 256 | 3259.348 | 3947.94 | 1.211266793
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 
>> 512 | 1515.147 | 1536.647 | 1.014190042
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 
>> 1024 | 911.58 | 1030.54 | 1.130498695
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 
>> 256 | 2034.611 | 2073.764 | 1.019243482
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 
>> 512 | 1110.659 | 1116.093 | 1.004892591
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 
>> 1024 | 559.269 | 559.651 | 1.000683034
>> 

Re: RFR: 8273322: Enhance macro logic optimization for masked logic operations. [v4]

2022-01-05 Thread Vladimir Kozlov
On Wed, 5 Jan 2022 08:59:00 GMT, Jatin Bhateja  wrote:

>> Patch extends existing macrologic inferencing algorithm to handle masked 
>> logic operations.
>> 
>> Existing algorithm:
>> 
>> 1. Identify logic cone roots.
>> 2. Packs parent and logic child nodes into a MacroLogic node in bottom up 
>> traversal if input constraint are met.
>> i.e. maximum number of inputs which a macro logic node can have.
>> 3. Perform symbolic evaluation of logic expression tree by assigning value 
>> corresponding to a truth table column
>> to each input.
>> 4. Inputs along with encoded function together represents a macro logic node 
>> which mimics a truth table.
>> 
>> Modification:
>> Extended the packing algorithm to operate on both predicated or 
>> non-predicated logic nodes. Following
>> rules define the criteria under which nodes gets packed into a macro logic 
>> node:-
>> 
>> 1. Parent and both child nodes are all unmasked or masked with same 
>> predicates.
>> 2. Masked parent can be packed with left child if it is predicated and both 
>> have same prediates.
>> 3. Masked parent can be packed with right child if its un-predicated or has 
>> matching predication condition.
>> 4. An unmasked parent can be packed with an unmasked child.
>> 
>> New jtreg test case added with the patch exhaustively covers all the 
>> different combinations of predications of parent and
>> child nodes.
>> 
>> Following are the performance number for JMH benchmark included with the 
>> patch.
>> 
>> Machine Configuration:  Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S 
>> Icelake Server)
>> 
>> Benchmark | ARRAYLEN | Baseline (ops/s) | Withopt (ops/s) | Gain ( 
>> withopt/baseline)
>> -- | -- | -- | -- | --
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 64 | 2365.421 | 5136.283 
>> | 2.171403315
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 128 | 2034.1 | 4073.381 | 
>> 2.002547072
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 256 | 1568.694 | 2811.975 
>> | 1.792558013
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 512 | 883.261 | 1662.771 
>> | 1.882536419
>> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 1024 | 469.513 | 732.81 | 
>> 1.560787454
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 64 | 273.049 | 552.106 | 
>> 2.022003377
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 128 | 219.624 | 359.775 | 
>> 1.63814064
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 256 | 131.649 | 182.23 | 
>> 1.384211046
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 512 | 71.452 | 81.522 | 
>> 1.140933774
>> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 1024 | 37.427 | 41.966 | 
>> 1.121276084
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 64 | 2805.759 | 3383.16 | 
>> 1.205791374
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 128 | 2069.012 | 2250.37 
>> | 1.087654397
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 256 | 1098.766 | 1101.996 
>> | 1.002939661
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 512 | 470.035 | 484.732 | 
>> 1.031267884
>> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 1024 | 202.827 | 209.073 
>> | 1.030794717
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 256 
>> | 3435.989 | 4418.09 | 1.285827749
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 512 
>> | 1524.803 | 1678.201 | 1.100601848
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 
>> 1024 | 972.501 | 1166.734 | 1.199725244
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 256 
>> | 5980.85 | 7584.17 | 1.268075608
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 512 
>> | 3258.108 | 3939.23 | 1.209054457
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 
>> 1024 | 1475.365 | 1511.159 | 1.024261115
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 256 
>> | 4208.766 | 4220.678 | 1.002830283
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 512 
>> | 2056.651 | 2049.489 | 0.99651764
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 
>> 1024 | 1110.461 | 1116.448 | 1.005391455
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 
>> 256 | 3259.348 | 3947.94 | 1.211266793
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 
>> 512 | 1515.147 | 1536.647 | 1.014190042
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 
>> 1024 | 911.58 | 1030.54 | 1.130498695
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 
>> 256 | 2034.611 | 2073.764 | 1.019243482
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 
>> 512 | 1110.659 | 1116.093 | 1.004892591
>> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 
>> 1024 | 559.269 | 559.651 | 1.000683034
>> 

Re: RFR: 8273322: Enhance macro logic optimization for masked logic operations. [v4]

2022-01-05 Thread Jatin Bhateja
> Patch extends existing macrologic inferencing algorithm to handle masked 
> logic operations.
> 
> Existing algorithm:
> 
> 1. Identify logic cone roots.
> 2. Packs parent and logic child nodes into a MacroLogic node in bottom up 
> traversal if input constraint are met.
> i.e. maximum number of inputs which a macro logic node can have.
> 3. Perform symbolic evaluation of logic expression tree by assigning value 
> corresponding to a truth table column
> to each input.
> 4. Inputs along with encoded function together represents a macro logic node 
> which mimics a truth table.
> 
> Modification:
> Extended the packing algorithm to operate on both predicated or 
> non-predicated logic nodes. Following
> rules define the criteria under which nodes gets packed into a macro logic 
> node:-
> 
> 1. Parent and both child nodes are all unmasked or masked with same 
> predicates.
> 2. Masked parent can be packed with left child if it is predicated and both 
> have same prediates.
> 3. Masked parent can be packed with right child if its un-predicated or has 
> matching predication condition.
> 4. An unmasked parent can be packed with an unmasked child.
> 
> New jtreg test case added with the patch exhaustively covers all the 
> different combinations of predications of parent and
> child nodes.
> 
> Following are the performance number for JMH benchmark included with the 
> patch.
> 
> Machine Configuration:  Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S 
> Icelake Server)
> 
> Benchmark | ARRAYLEN | Baseline (ops/s) | Withopt (ops/s) | Gain ( 
> withopt/baseline)
> -- | -- | -- | -- | --
> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 64 | 2365.421 | 5136.283 | 
> 2.171403315
> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 128 | 2034.1 | 4073.381 | 
> 2.002547072
> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 256 | 1568.694 | 2811.975 
> | 1.792558013
> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 512 | 883.261 | 1662.771 | 
> 1.882536419
> o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 1024 | 469.513 | 732.81 | 
> 1.560787454
> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 64 | 273.049 | 552.106 | 
> 2.022003377
> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 128 | 219.624 | 359.775 | 
> 1.63814064
> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 256 | 131.649 | 182.23 | 
> 1.384211046
> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 512 | 71.452 | 81.522 | 
> 1.140933774
> o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 1024 | 37.427 | 41.966 | 
> 1.121276084
> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 64 | 2805.759 | 3383.16 | 
> 1.205791374
> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 128 | 2069.012 | 2250.37 | 
> 1.087654397
> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 256 | 1098.766 | 1101.996 
> | 1.002939661
> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 512 | 470.035 | 484.732 | 
> 1.031267884
> o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 1024 | 202.827 | 209.073 | 
> 1.030794717
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 256 
> | 3435.989 | 4418.09 | 1.285827749
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 512 
> | 1524.803 | 1678.201 | 1.100601848
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 1024 
> | 972.501 | 1166.734 | 1.199725244
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 256 
> | 5980.85 | 7584.17 | 1.268075608
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 512 
> | 3258.108 | 3939.23 | 1.209054457
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 1024 
> | 1475.365 | 1511.159 | 1.024261115
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 256 
> | 4208.766 | 4220.678 | 1.002830283
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 512 
> | 2056.651 | 2049.489 | 0.99651764
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 1024 
> | 1110.461 | 1116.448 | 1.005391455
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 256 
> | 3259.348 | 3947.94 | 1.211266793
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 512 
> | 1515.147 | 1536.647 | 1.014190042
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 
> 1024 | 911.58 | 1030.54 | 1.130498695
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 256 
> | 2034.611 | 2073.764 | 1.019243482
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 512 
> | 1110.659 | 1116.093 | 1.004892591
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 
> 1024 | 559.269 | 559.651 | 1.000683034
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt128 | 256 
> | 3636.141 | 4446.505 | 1.222863745
> o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt128 | 512 
> |