seddonm1 commented on a change in pull request #8794:
URL: https://github.com/apache/arrow/pull/8794#discussion_r534629774



##########
File path: rust/arrow/src/compute/kernels/cast.rs
##########
@@ -376,6 +378,27 @@ pub fn cast(array: &ArrayRef, to_type: &DataType) -> 
Result<ArrayRef> {
             Int64 => cast_string_to_numeric::<Int64Type>(array),
             Float32 => cast_string_to_numeric::<Float32Type>(array),
             Float64 => cast_string_to_numeric::<Float64Type>(array),
+            Date32(DateUnit::Day) => {
+                use chrono::{NaiveDate, NaiveTime};
+                let zero_time = NaiveTime::from_hms(0, 0, 0);
+                let string_array = 
array.as_any().downcast_ref::<StringArray>().unwrap();
+                let mut builder = 
PrimitiveBuilder::<Date32Type>::new(string_array.len());
+                for i in 0..string_array.len() {
+                    if string_array.is_null(i) {
+                        builder.append_null()?;
+                    } else {
+                        match NaiveDate::parse_from_str(string_array.value(i), 
"%Y-%m-%d")
+                        {
+                            Ok(date) => builder.append_value(
+                                (date.and_time(zero_time).timestamp() / 
SECONDS_IN_DAY)
+                                    as i32,
+                            )?,
+                            Err(_) => builder.append_null()?, // not a valid 
date

Review comment:
       @nevi-me 
   Yes, that is definitely an option but I would hope that the default is 
strict with the option to relax for users that understand the risk.
   
   We wrote a lot of code 
(https://github.com/tripl-ai/arc/blob/master/src/main/scala/ai/tripl/arc/transform/TypingTransform.scala)
 to safely apply correct data typing from strings in Spark in a way that allows 
us to parse the entire dataset and collect errors per row before returning. 
   
   This works in one of two ways (like you have described):
   - a way that returns NULL for invalid conversions and list of errors for the 
row (which we should have done a `map[field name, error]`) which can then be 
addressed by the user. If any field that contains an error is not nullable then 
the job fails immediately.
   - a ways that fails fast simulating the ANSI approach of first error found 
throws error (which can be very slow to debug).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to