martin-traverse opened a new pull request, #779: URL: https://github.com/apache/arrow-java/pull/779
## What's Changed Updated ArrowToAvro to output dictionary-encoded string vectors as Avro enums, where possible. Apologies for the delay - busy as usual! To output dict encoded vectors as enums, a dictionary provider must be supplied to the top level methods with all the required dictionaries. All dictionary values must be present when the schema is written, i.e. before the data blocks are produced. If data is being written as a schema followed by multiple blocks, values added to a dictionary in between blocks will not be included in the schema resulting in an invalid Avro file (in general supply an invalid dictionary mapping will result in invalid output). Dictionary encoded fields are checked to ensure they are valid Avro enums. If the dictionary encoded field is not a string field, or the string values are not valid Avro enums, the field is decoded and output as literal values. This is done by calling DictionaryEncoder.decode(vector, dictionary), which will consume memory for the vector. An alternative approach would be to decode values one-by-one, however this would require a significant change to the producer pattern since the current producers expect concrete vectors of the output type. Another option would be to throw an error if there are dictionary-encoded vectors that are not string types, i.e. push the responsibility onto client code. I'm not sure which approach is best - happy to take any guidance and I will update the code accordingly. To read enums back the current approach for decoding is unchanged (the AvroToArrow config has to be set up with a MapDictionaryProvider which is populated when data is read). The last part of the Avro work is to add the capability for reading / writing whole files block-by-block, so there is an opportunity to do something with the top level APIs there, for now the current API works and I've used it in the round trip tests. Please let me know any feedback, happy to update as needed! Closes #731. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
