seddonm1 commented on a change in pull request #8794:
URL: https://github.com/apache/arrow/pull/8794#discussion_r534629774
##########
File path: rust/arrow/src/compute/kernels/cast.rs
##########
@@ -376,6 +378,27 @@ pub fn cast(array: &ArrayRef, to_type: &DataType) ->
Result<ArrayRef> {
Int64 => cast_string_to_numeric::<Int64Type>(array),
Float32 => cast_string_to_numeric::<Float32Type>(array),
Float64 => cast_string_to_numeric::<Float64Type>(array),
+ Date32(DateUnit::Day) => {
+ use chrono::{NaiveDate, NaiveTime};
+ let zero_time = NaiveTime::from_hms(0, 0, 0);
+ let string_array =
array.as_any().downcast_ref::<StringArray>().unwrap();
+ let mut builder =
PrimitiveBuilder::<Date32Type>::new(string_array.len());
+ for i in 0..string_array.len() {
+ if string_array.is_null(i) {
+ builder.append_null()?;
+ } else {
+ match NaiveDate::parse_from_str(string_array.value(i),
"%Y-%m-%d")
+ {
+ Ok(date) => builder.append_value(
+ (date.and_time(zero_time).timestamp() /
SECONDS_IN_DAY)
+ as i32,
+ )?,
+ Err(_) => builder.append_null()?, // not a valid
date
Review comment:
@nevi-me
Yes, that is definitely an option but I would hope that the default is
strict with the option to relax for users that understand the risk.
We wrote a lot of code
(https://github.com/tripl-ai/arc/blob/master/src/main/scala/ai/tripl/arc/transform/TypingTransform.scala)
to safely apply correct data typing from strings in Spark in a way that allows
us to parse the entire dataset and collect errors per row before returning.
This works in one of two ways (like you have described):
- a way that returns NULL for invalid conversions and list of errors for the
row (which we should have done a `map[field name, error]`) which can then be
addressed by the user. If any field that contains an error is not nullable then
the job fails immediately.
- a ways that fails fast simulating the ANSI approach of first error found
throws error (which can be very slow to debug).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]