Re: [PR] GH-615: Produce Avro core data types out of Arrow VSR [arrow-java]

via GitHub Sun, 02 Mar 2025 09:00:29 -0800


martin-traverse commented on PR #638:
URL: https://github.com/apache/arrow-java/pull/638#issuecomment-2692815888


   Hello. I have pushed a second iteration for review. Major updates are:
   
   * Added schema generation, to convert Arrow List<Field> to Avro Schema 
(record schema). I added doc comments with the type mapping. Schema is built of 
ArrowType while physical producers use MinorType, that bugged me at first but 
on reflection probably seems correct?
   * Updated the producers with a more complete set of types, including all 
widths of float / int types and logical types for all the date / time types 
(but not duration as of yet). Would appreciate a set of eyes on the timezone 
mapping for TZ aware timestamps! Per my reading of the Avro spec this is 
correct but worth double checking.
   
   I have tried to make it so most Arrow types will map to an Avro type if 
there is a suitable target. The basic approach is:
   
   * If a direct equivalent exists, use that
   * Next, choose an equivalent with no loss of precision (e.g. small int -> 
int)
   * Next, choose an equivalent with loss of precision but no loss of range 
(e.g. time nano -> time micros)
   * Finally, choose an equivalent with the best match and error on overflow 
(e.g. unsigned long -> long)
   
   I'm not sure about the last one, would it be better just to say we can't 
convert those types and throw an error up front?
   
   Large types (for text, binary, list) are not supported yet, there will be 
some extra thought needed on getting very large buffers between the two APIs. 
Interval / duration also not done yet, Arrow Interval resembles Avro Duration 
but is not an exact match and Arrow duration doesn't really have an equivalent 
that I could see. View types also not included as of now.
   
   I tried to move over to using long as the current row index, however a lot 
of the high-level APIs on the vector types use ints. For working directly with 
primitive types I was able to go to the underlying buffer (there were reasons 
to do that anyway in a few cases), but for the complex types that gets trickier 
because the type of the child vectors is not known. Is it ok to keep the 
interface using an int for current index for now? Most of the primitive 
producers are updated anyway, so the work to move over later would not be too 
much.
   
   I've run out of time for today. Next I plan to do the tests - like the 
consumers, will be exhaustive across types, combinations of nullability etc 
with some edge cases. Then the last thing I'd like to come back to on this PR 
is enums / dict encoding, just in case that has any impact on the structure of 
things would be good to know while it's still easy to change things.
   
   I have addressed some of the comments but not all, I will work through the 
rest but wanted to share this to check I'm still going in the right direction!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-615: Produce Avro core data types out of Arrow VSR [arrow-java]

Reply via email to