On Mon, 8 Jan 2024 10:20:33 GMT, Quan Anh Mai <qa...@openjdk.org> wrote:

>>> Thanks for the updates!
>>> 
>>> One more idea: Your AVX2 solution has a lot of cost for converting the mask 
>>> to a permutation. Might it make sense to split this off into a separate 
>>> vector-node, so that it can float out of a loop if the mask is invariant?
>> 
>> CompressV / ExpandV only accepts two inputs, vector to be operated on and 
>> mask under which operation is performed, permute table based implementation 
>> is specific to x86 backend implementation.
>
> @jatin-bhateja I think you can expand them in the matcher into several 
> `MachNode`s that will get scheduled separately.

> Exactly, like @merykitty suggests: you can do a platform-dependent expansion.

Hi @merykitty , @eme64 , in principle platform specific lowering is a good idea 
where ever useful, our main concern here is to identify a loop invariant 
constant mask in matcher patterns and save the cost of re-loading from a 
permute table index. Existing loop invariant analysis moves invariant masks out 
of loop and GCM should be able to move expanded load from permute table out of 
loop. 

But this looks very restrictive and will mainly be useful for constant one hot 
bit mask pattern. A constant mask may have more than one set bits and in such a 
case we will need to generate multiple loads from permute tables and handle 
multiple expansion scenarios. I think we can defer that complexity for that 
time being.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/17261#issuecomment-1882549544

Reply via email to