martin-traverse opened a new pull request, #779:
URL: https://github.com/apache/arrow-java/pull/779

   ## What's Changed
   
   Updated ArrowToAvro to output dictionary-encoded string vectors as Avro 
enums, where possible. Apologies for the delay - busy as usual!
   
   To output dict encoded vectors as enums, a dictionary provider must be 
supplied to the top level methods with all the required dictionaries. All 
dictionary values must be present when the schema is written, i.e. before the 
data blocks are produced. If data is being written as a schema followed by 
multiple blocks, values added to a dictionary in between blocks will not be 
included in the schema resulting in an invalid Avro file (in general supply an 
invalid dictionary mapping will result in invalid output).
   
   Dictionary encoded fields are checked to ensure they are valid Avro enums. 
If the dictionary encoded field is not a string field, or the string values are 
not valid Avro enums, the field is decoded and output as literal values. This 
is done by calling DictionaryEncoder.decode(vector, dictionary), which will 
consume memory for the vector. An alternative approach would be to decode 
values one-by-one, however this would require a significant change to the 
producer pattern since the current producers expect concrete vectors of the 
output type. Another option would be to throw an error if there are 
dictionary-encoded vectors that are not string types, i.e. push the 
responsibility onto client code. I'm not sure which approach is best - happy to 
take any guidance and I will update the code accordingly.
   
   To read enums back the current approach for decoding is unchanged (the 
AvroToArrow config has to be set up with a MapDictionaryProvider which is 
populated when data is read). The last part of the Avro work is to add the 
capability for reading / writing whole files block-by-block, so there is an 
opportunity to do something with the top level APIs there, for now the current 
API works and I've used it in the round trip tests.
   
   Please let me know any feedback, happy to update as needed!
   
   Closes #731.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to