[ 
https://issues.apache.org/jira/browse/ARROW-5303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992598#comment-16992598
 ] 

Andy Thomason commented on ARROW-5303:
--------------------------------------

It can be quite daunting. I'm happy to help with understanding the asm. I spent 
seven years teaching it to game programmers! I'm also quite old and grew up in 
a time when you wrote the instructions out by hand in hex.

Matt's website is a godsend (to use a horrible pun).
{code:java}
.LBB0_5:
        vpmovzxbd       ymm0, qword ptr [rdx + rcx]
        vpmovzxbd       ymm1, qword ptr [rdx + rcx + 8]
        vpmovzxbd       ymm2, qword ptr [rdx + rcx + 16]
        vpmovzxbd       ymm3, qword ptr [rdx + rcx + 24]
        vmovdqu ymmword ptr [rdi + 4*rcx], ymm0
        vmovdqu ymmword ptr [rdi + 4*rcx + 32], ymm1
        vmovdqu ymmword ptr [rdi + 4*rcx + 64], ymm2
        vmovdqu ymmword ptr [rdi + 4*rcx + 96], ymm3
        add     rcx, 32
        cmp     rax, rcx
        jne     .LBB0_5
{code}
The  first instruction "vpmovzxbd" loads and converts 8 bytes of u8 to 32 bytes 
of u32.

The second instruction "vmovdqu" does an unaligned store of the value to 32 
bytes of memory. Note that the index goes up by 8 and 32 in each case.

The last two instructions are just the loop management.

The instructions themselves have almost zero cost, but writing the data out 
through the cache could be very expensive.

The thing to look for here is lots of ymm or zmm regsiters and counters going 
up in large increments. You don't need to know every instruction, but this kind 
of pattern (four loads, four stores, loop) is about as good as it gets.

The loads occur in groups of four because there is a large latency on every 
instruction. We can start lots of them per cycle but it will take many cycles 
to get the data to RAM. Think of it as a production line with people fetching 
data from a warehouse and putting it on a conveyor belt and then taking it off 
and carrying it to another warehouse.

The conveyor belts can be quite long, but we can put lots of data on the belt 
at the same time.

 

 

> [Rust] Add SIMD vectorization of numeric casts
> ----------------------------------------------
>
>                 Key: ARROW-5303
>                 URL: https://issues.apache.org/jira/browse/ARROW-5303
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust
>    Affects Versions: 0.13.0
>            Reporter: Neville Dipale
>            Priority: Minor
>
> To improve the performance of cast kernels, we need SIMD support in numeric 
> casts.
> An initial exploration shows that we can't trivially add SIMD casts between 
> our Arrow T::Simd types, because `packed_simd` only supports a cast between 
> T::Simd types that have the same number of lanes.
> This means that adding casts from f64 to i64 (same lane length) satisfies the 
> bound trait `where TO::Simd : packed_simd::FromCast<FROM::Simd>`, but f64 to 
> i32 (different lane length) doesn't.
> We would benefit from investigating work-arounds to this limitation. Please 
> see 
> [github::nevi_me::arrow/\{branch:simd-cast}/../kernels/cast.rs|[https://github.com/nevi-me/arrow/blob/simd-cast/rust/arrow/src/compute/kernels/cast.rs#L601]]
>  for an example implementation that's limited by the differences in lane 
> length.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to