Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

via GitHub Wed, 29 Jan 2025 10:06:04 -0800


kylebarron commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1934355453



##########
arrow-json/src/writer/encoder.rs:
##########
@@ -25,126 +27,157 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
 #[derive(Debug, Clone, Default)]
 pub struct EncoderOptions {
     pub explicit_nulls: bool,
     pub struct_mode: StructMode,
+    pub encoder_factory: Option<Arc<dyn EncoderFactory>>,
+}
+
+/// A trait to create custom encoders for specific data types.
+///
+/// This allows overriding the default encoders for specific data types,
+/// or adding new encoders for custom data types.
+pub trait EncoderFactory: std::fmt::Debug {
+    /// Make an encoder that if returned runs before all of the default 
encoders.
+    /// This can be used to override how e.g. binary data is encoded so that 
it is an encoded string or an array of integers.
+    fn make_default_encoder<'a>(
+        &self,
+        _array: &'a dyn Array,

Review Comment:
   I'd be interested in using the changes in this PR to write 
[GeoJSON](https://datatracker.ietf.org/doc/html/rfc7946) from 
[geoarrow-rs](https://github.com/geoarrow/geoarrow-rs).
   
   However this API would not be sufficient for me because it assumes that the 
physical `Array` is enough to know how to encode the data. This is not true for 
geospatial data (at least for Arrow data according to the [GeoArrow 
specification](https://geoarrow.org/)) because the same physical layout can 
describe multiple types. 
   
   E.g. an array of `LineString` and an array of `MultiPoint` would both be 
stored as an Arrow `List[FixedSizeList[2, Float64]]`, but the extension 
metadata on the `Field` would be necessary to know whether to write [`"type": 
"MultiPoint"`](https://datatracker.ietf.org/doc/html/rfc7946#section-3.1.3) or 
[`"type": 
"LineString"`](https://datatracker.ietf.org/doc/html/rfc7946#section-3.1.4) in 
each JSON object.
   
   Given that the existing json `Writer` API writes a `RecordBatch`, it should 
be possible to access the `Field` and pass that through here, instead of just 
using the `Array`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

Reply via email to