martin-traverse commented on PR #638: URL: https://github.com/apache/arrow-java/pull/638#issuecomment-2692815888
Hello. I have pushed a second iteration for review. Major updates are: * Added schema generation, to convert Arrow List<Field> to Avro Schema (record schema). I added doc comments with the type mapping. Schema is built of ArrowType while physical producers use MinorType, that bugged me at first but on reflection probably seems correct? * Updated the producers with a more complete set of types, including all widths of float / int types and logical types for all the date / time types (but not duration as of yet). Would appreciate a set of eyes on the timezone mapping for TZ aware timestamps! Per my reading of the Avro spec this is correct but worth double checking. I have tried to make it so most Arrow types will map to an Avro type if there is a suitable target. The basic approach is: * If a direct equivalent exists, use that * Next, choose an equivalent with no loss of precision (e.g. small int -> int) * Next, choose an equivalent with loss of precision but no loss of range (e.g. time nano -> time micros) * Finally, choose an equivalent with the best match and error on overflow (e.g. unsigned long -> long) I'm not sure about the last one, would it be better just to say we can't convert those types and throw an error up front? Large types (for text, binary, list) are not supported yet, there will be some extra thought needed on getting very large buffers between the two APIs. Interval / duration also not done yet, Arrow Interval resembles Avro Duration but is not an exact match and Arrow duration doesn't really have an equivalent that I could see. View types also not included as of now. I tried to move over to using long as the current row index, however a lot of the high-level APIs on the vector types use ints. For working directly with primitive types I was able to go to the underlying buffer (there were reasons to do that anyway in a few cases), but for the complex types that gets trickier because the type of the child vectors is not known. Is it ok to keep the interface using an int for current index for now? Most of the primitive producers are updated anyway, so the work to move over later would not be too much. I've run out of time for today. Next I plan to do the tests - like the consumers, will be exhaustive across types, combinations of nullability etc with some edge cases. Then the last thing I'd like to come back to on this PR is enums / dict encoding, just in case that has any impact on the structure of things would be good to know while it's still easy to change things. I have addressed some of the comments but not all, I will work through the rest but wanted to share this to check I'm still going in the right direction! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
