quinnj opened a new pull request, #357:
URL: https://github.com/apache/arrow-julia/pull/357

   Fixes https://github.com/apache/arrow-julia/issues/345.
   
   Alternative to https://github.com/apache/arrow-julia/pull/350.
   
   Whew, a bit of a rabbit hole here. It turns out this issue doesn't have 
anything to do w/ the _arrow_ alignment when writing, but actually goes back to 
a core Julia issue, specifically the inconsistency reported in 
https://github.com/JuliaLang/julia/issues/42326. Essentially, the issue is that 
different platforms (intel vs. arm64) are requiring different alignments for 
types like `UInt128` (8 for intel, 16 for arm64).
   
   And it turns out this isn't even a Julia issue, but all the way back to LLVM.
   
   So why does this crop up in Arrow.jl? Because when you try to 
serialize/deserialize a `Arrow.Primitive{Int128}` array, we're going to write 
it out in the proper arrow format, but when you read it back in, we've been 
using the zero-copy technique of:
   
   ```julia
   unsafe_wrap(Array, Ptr{Int128}(arrow_ptr), len)
   ```
   in order to give the user a Julia `Vector{Int128}` to the underlying arrow 
data.
   
   _BUT_, in Julia when you allocate any `Vector` with an eltype size of less 
than 4, then only 8 bytes of alignment are specified. So the fact that most 
users will pass arrow data as a file (which is mmapped as `Vector{UInt8}`), a 
byte vector directly (`Vector{UInt8}`), or an IO (which we read as 
`Vector{UInt8}`), means these vectors are only 8-byte aligned. This then throws 
the fatal error reported in the original #350 issue about the pointer not being 
16-byte aligned.
   
   So one option to consider is allowing users to pass in a 16-byte aligned 
arrow "source" and then that would just work, right? Well, except Julia doesn't 
expose any way of "upcasting" an array's alignment, so it's purely based off 
the array eltype. Which in turn is a struggle because the current Arrow.jl 
architecture assumes the source will be exactly a `Vector{UInt8}` everywhere. 
So essentially it requires some heruclean effort to try and pass your own 
aligned array; I think the only option I can think of is if you did something 
like:
     * Mmap an arrow file into Vector{UInt8}
     * Allocate your own Vector{UInt32} and copy arrow data into it
     * Use unsafe_wrap to make a Vector{UInt8} of the Vector{UInt32} data
     * But you _MUST_ keep a reference to the original Vector{UInt32} array 
otherwise, your new Vector{UInt8} gets corrupted
     * Pass the new Vector{UInt8} into Arrow.Table to read
   
   Anyway, not easy to say the least.
   
   The proposed solution here is as follows:
     * We check `sizeof(T)` to see if it's > 16 bytes, which we're using as a 
proxy for the alignment, regardless of platform (core devs are leaning towards 
the 8-byte alignment on intel actually being a bug for 16-byte primitives and I 
agree)
     * We use sizeof since the `jl_datatype_t` alignment field isn't part of 
the public Julia API and thus subject to change (and in fact I think it did 
just change location in Julia#master)
     * It's a good enough proxy for our purposes anyway
     * If the original arrow pointer _isn't_ 16-byte aligned, then we'll 
allocate a new `Vector{T}`, which _will_ be aligned, then copy the arrow data 
directly into it via pointers. Simple, easy, just one extra allocation/copy.
   
   If Julia _does_ get the ability in the future to specify a custom 
larger-than-eltype-required alignemnt for arrays, then we could potentially do 
that ourselves when reading, but it's a little tricky because we really only 
need to do that if there are 16-byte primitives we'll be deserializing and we 
don't know that until we read a schema message. So *shrug*.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to