scovich commented on code in PR #9021:
URL: https://github.com/apache/arrow-rs/pull/9021#discussion_r2713781929


##########
arrow-json/src/reader/mod.rs:
##########
@@ -373,6 +386,95 @@ impl<R: BufRead> RecordBatchReader for Reader<R> {
     }
 }
 
+/// A trait to create custom decoders for specific data types.
+///
+/// This allows overriding the default decoders for specific data types,
+/// or adding new decoders for custom data types.
+///
+/// # Examples
+///
+/// ```
+/// use arrow_json::{ArrayDecoder, DecoderFactory, TapeElement, Tape, 
ReaderBuilder, StructMode};
+/// use arrow_schema::ArrowError;
+/// use arrow_schema::{DataType, Field, Fields, Schema};
+/// use arrow_array::cast::AsArray;
+/// use arrow_array::Array;
+/// use arrow_array::builder::StringBuilder;
+/// use arrow_data::ArrayData;
+/// use std::sync::Arc;
+///
+/// struct IncorrectStringAsNullDecoder {}
+///
+/// impl ArrayDecoder for IncorrectStringAsNullDecoder {
+///     fn decode(&mut self, tape: &Tape<'_>, pos: &[u32]) -> 
Result<ArrayData, ArrowError> {
+///         let mut builder = StringBuilder::new();
+///         for p in pos {
+///             match tape.get(*p) {
+///                 TapeElement::String(idx) => {
+///                     builder.append_value(tape.get_string(idx));
+///                 }
+///                 _ => builder.append_null(),
+///             }
+///         }
+///         Ok(builder.finish().into_data())
+///     }
+/// }
+///
+/// #[derive(Debug)]
+/// struct IncorrectStringAsNullDecoderFactory;
+///
+/// impl DecoderFactory for IncorrectStringAsNullDecoderFactory {
+///     fn make_default_decoder<'a>(
+///         &self,
+///         _field: Option<FieldRef>,
+///         data_type: DataType,
+///         _coerce_primitive: bool,
+///         _strict_mode: bool,
+///         _is_nullable: bool,
+///         _struct_mode: StructMode,
+///     ) -> Result<Option<Box<dyn ArrayDecoder>>, ArrowError> {
+///         match data_type {
+///             DataType::Utf8 => 
Ok(Some(Box::new(IncorrectStringAsNullDecoder {}))),
+///             _ => Ok(None),
+///         }
+///     }
+/// }
+///
+/// let json = r#"
+/// {"a": "a"}
+/// {"a": 12}
+/// "#;
+/// let batch = 
ReaderBuilder::new(Arc::new(Schema::new(Fields::from(vec![Field::new(
+///     "a",
+///     DataType::Utf8,
+///     true,
+/// )]))))
+/// .with_decoder_factory(Arc::new(IncorrectStringAsNullDecoderFactory))
+/// .build(json.as_bytes())
+/// .unwrap()
+/// .next()
+/// .unwrap()
+/// .unwrap();
+///
+/// let values = batch.column(0).as_string::<i32>();
+/// assert_eq!(values.len(), 2);
+/// assert_eq!(values.value(0), "a");
+/// assert!(values.is_null(1));
+/// ```
+pub trait DecoderFactory: std::fmt::Debug + Send + Sync {

Review Comment:
   I finally found time to chew on this a bit more, and I agree that schema 
annotation will be the best way to "tag" specific fields in the schema. The 
only potential annoyance is the resulting Array shouldn't carry those 
annotations any more, so we might need a specific annotation name that the 
decoder machinery intentionally strips away when creating the final array?
   
   Meanwhile, if make `make_decoder` public, I think we can handle just about 
everything else pretty easily. For example:
   * If I want a struct decoder that returns NULL instead of error when 
encountering a string -- use `make_decoder` to create a "backing" decoder. 
Handle "bad" cases directly and delegate "good" cases to the backing decoder
   * If I want a list decoder that expects `["a", "b", "c"]` but needs to also 
interpret `"[a, b, c]"` as a list of strings (actually encountered in 
production) -- again delegate the normal case to the backing (default) decoder 
and handle the special case manually



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to