martin-traverse opened a new pull request, #638: URL: https://github.com/apache/arrow-java/pull/638
## What's Changed Per discussion in #615 , here is a first take on the core producers to generate Avro data from Arrow vectors. There are a few points I'd like to clarify before going further: * Nullability. Avro only understands nullable types as unions but that is not normally how they will be if the data comes from other sources. I have added a special NullableProducer to handle nullable vectors which are not unions. We will need something equivalent in the consumers and probably a setting in the AvroToArrowConfig to control it on read, defaulting to current behaviour. I have also added special handling for nullable unions, because unions can't be nested (i.e. you can't nest "type | null" as a type inside a union). I can add consumers to handle both (unions and regular types) for review, if that sounds right? At the moment the schema for nullable fields gets quite mangled on a round trip! * Arrow has a lot more types than Avro, at the level of minor / vector types. Going Avro -> Arrow we just pick the direct equivalent. Going Arrow -> Avro, we could cast silently if there is no loss of precision. E.g. TinyInt and SmallInt -> Int and so on. For types like e.g. Decimal256 and LargeVarChar we could write out safely but would need support in the consumers to read back the wider types. I could start by adding the safe conversions now and we could come back to the wide types in a later PR maybe? * Type information is inferred from the list of vectors, using minor types. We'll also need to generate the Avro schema, the input for that would be a list of fields. I haven't done it yet but will do if that sounds right. * Dictionary encoding for enums not implemented yet, I'll add it if the rest looks good. Caveat is that dictionaries must be fixed before the encoding starts if we are writing out the whole file in one go (i.e. if the Avro schema is at the start of the container file). If the schema is saved separately that limitation need not apply, we could provide the schema once encoding is finished. Please do let me know if this is going in the right direction + any comments. If it is I will add the missing pieces and start the exhaustive test coverage to mirror the consumers. Once it's done this PR should get us to the point where we can round trip the contents of an individual block for most data types, but it does not address the container format. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
