[PR] Feature / Produce Avro core data types out of Arrow VSR [arrow-java]

via GitHub Tue, 25 Feb 2025 18:52:37 -0800


martin-traverse opened a new pull request, #638:
URL: https://github.com/apache/arrow-java/pull/638


   ## What's Changed
   
   Per discussion in #615 , here is a first take on the core producers to 
generate Avro data from Arrow vectors. There are a few points I'd like to 
clarify before going further:
   
   * Nullability. Avro only understands nullable types as unions but that is 
not normally how they will be if the data comes from other sources. I have 
added a special NullableProducer to handle nullable vectors which are not 
unions. We will need something equivalent in the consumers and probably a 
setting in the AvroToArrowConfig to control it on read, defaulting to current 
behaviour. I have also added special handling for nullable unions, because 
unions can't be nested  (i.e. you can't nest "type | null" as a type inside a 
union). I can add consumers to handle both (unions and regular types) for 
review, if that sounds right? At the moment the schema for nullable fields gets 
quite mangled on a round trip!
   
   * Arrow has a lot more types than Avro, at the level of minor / vector 
types. Going Avro -> Arrow we just pick the direct equivalent. Going Arrow -> 
Avro, we could cast silently if there is no loss of precision. E.g. TinyInt and 
SmallInt -> Int and so on. For types like e.g. Decimal256 and LargeVarChar we 
could write out safely but would need support in the consumers to read back the 
wider types. I could start by adding the safe conversions now and we could come 
back to the wide types in a later PR maybe?
   
   * Type information is inferred from the list of vectors, using minor types. 
We'll also need to generate the Avro schema, the input for that would be a list 
of fields. I haven't done it yet but will do if that sounds right.
   
   * Dictionary encoding for enums not implemented yet, I'll add it if the rest 
looks good. Caveat is that dictionaries must be fixed before the encoding 
starts if we are writing out the whole file in one go (i.e. if the Avro schema 
is at the start of the container file). If the schema is saved separately that 
limitation need not apply, we could provide the schema once encoding is 
finished.
   
   Please do let me know if this is going in the right direction + any 
comments. If it is I will add the missing pieces and start the exhaustive test 
coverage to mirror the consumers. Once it's done this PR should get us to the 
point where we can round trip the contents of an individual block for most data 
types, but it does not address the container format.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Feature / Produce Avro core data types out of Arrow VSR [arrow-java]

Reply via email to