debugmiller commented on code in PR #9021:
URL: https://github.com/apache/arrow-rs/pull/9021#discussion_r2640438802
##########
arrow-json/src/reader/mod.rs:
##########
@@ -373,6 +386,95 @@ impl<R: BufRead> RecordBatchReader for Reader<R> {
}
}
+/// A trait to create custom decoders for specific data types.
+///
+/// This allows overriding the default decoders for specific data types,
+/// or adding new decoders for custom data types.
+///
+/// # Examples
+///
+/// ```
+/// use arrow_json::{ArrayDecoder, DecoderFactory, TapeElement, Tape,
ReaderBuilder, StructMode};
+/// use arrow_schema::ArrowError;
+/// use arrow_schema::{DataType, Field, Fields, Schema};
+/// use arrow_array::cast::AsArray;
+/// use arrow_array::Array;
+/// use arrow_array::builder::StringBuilder;
+/// use arrow_data::ArrayData;
+/// use std::sync::Arc;
+///
+/// struct IncorrectStringAsNullDecoder {}
+///
+/// impl ArrayDecoder for IncorrectStringAsNullDecoder {
+/// fn decode(&mut self, tape: &Tape<'_>, pos: &[u32]) ->
Result<ArrayData, ArrowError> {
+/// let mut builder = StringBuilder::new();
+/// for p in pos {
+/// match tape.get(*p) {
+/// TapeElement::String(idx) => {
+/// builder.append_value(tape.get_string(idx));
+/// }
+/// _ => builder.append_null(),
+/// }
+/// }
+/// Ok(builder.finish().into_data())
+/// }
+/// }
+///
+/// #[derive(Debug)]
+/// struct IncorrectStringAsNullDecoderFactory;
+///
+/// impl DecoderFactory for IncorrectStringAsNullDecoderFactory {
+/// fn make_default_decoder<'a>(
+/// &self,
+/// _field: Option<FieldRef>,
+/// data_type: DataType,
+/// _coerce_primitive: bool,
+/// _strict_mode: bool,
+/// _is_nullable: bool,
+/// _struct_mode: StructMode,
+/// ) -> Result<Option<Box<dyn ArrayDecoder>>, ArrowError> {
+/// match data_type {
+/// DataType::Utf8 =>
Ok(Some(Box::new(IncorrectStringAsNullDecoder {}))),
+/// _ => Ok(None),
+/// }
+/// }
+/// }
+///
+/// let json = r#"
+/// {"a": "a"}
+/// {"a": 12}
+/// "#;
+/// let batch =
ReaderBuilder::new(Arc::new(Schema::new(Fields::from(vec![Field::new(
+/// "a",
+/// DataType::Utf8,
+/// true,
+/// )]))))
+/// .with_decoder_factory(Arc::new(IncorrectStringAsNullDecoderFactory))
+/// .build(json.as_bytes())
+/// .unwrap()
+/// .next()
+/// .unwrap()
+/// .unwrap();
+///
+/// let values = batch.column(0).as_string::<i32>();
+/// assert_eq!(values.len(), 2);
+/// assert_eq!(values.value(0), "a");
+/// assert!(values.is_null(1));
+/// ```
+pub trait DecoderFactory: std::fmt::Debug + Send + Sync {
Review Comment:
Im a little worried about this Pr growing in scope. (My main goal was to
make it so that variant could be decoded and I've already had to pull in the
extension code too). Nevertheless I think it's worth considering more
usefulness.
Im hearing two related uses.
First, for changing the behavior of some fields in complex types: I think
another way of handling this is to augment the schema in some way with a
special decoder kind. This could allow for the default behavior to be used by
the decoder (including on structs/lists) but specialized decoders substituted
in directly on the (sub)fields as indicated by the schema. Im not sure if this
would fit into the existing schema type (eg special metadata) or if a new kind
of `DecoderSchema` would be created to hold this information.
Second for catching errors on primitive types and substituting something
else (e.g. NULL) my suggestion is to make that be an optional setting on the
built in primitive decoder. Possibly when combined with the solution above to
make it so that certain fields could be marked with this behavior in the
schema.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]