mbutrovich commented on code in PR #4541:
URL: https://github.com/apache/datafusion-comet/pull/4541#discussion_r3507285809
##########
spark/src/main/scala/org/apache/spark/sql/comet/util/Utils.scala:
##########
@@ -108,6 +108,7 @@ object Utils extends CometTypeShim with Logging {
case yi: ArrowType.Interval if yi.getUnit == IntervalUnit.YEAR_MONTH =>
YearMonthIntervalType()
case di: ArrowType.Interval if di.getUnit == IntervalUnit.DAY_TIME =>
DayTimeIntervalType()
+ case d: ArrowType.Duration if d.getUnit == TimeUnit.MICROSECOND =>
DayTimeIntervalType()
Review Comment:
Observation, no change requested. `fromArrowType` now maps both
`Interval(DayTime)` (line 110) and `Duration(Microsecond)` (line 111) to
`DayTimeIntervalType`, while `toArrowType` only ever emits
`Duration(Microsecond)`. So the `Interval(DayTime)` reverse case is dead for
Comet-produced data and is presumably kept for externally-produced Arrow. That
is fine and consistent with the precision rationale, just flagging the
asymmetry in case it was unintentional.
##########
spark/src/main/scala/org/apache/spark/sql/comet/util/Utils.scala:
##########
@@ -151,6 +152,10 @@ object Utils extends CometTypeShim with Logging {
case NullType => ArrowType.Null.INSTANCE
case dt if isTimeType(dt) =>
new ArrowType.Time(TimeUnit.NANOSECOND, 64)
+ case _: YearMonthIntervalType => new
ArrowType.Interval(IntervalUnit.YEAR_MONTH)
+ // Spark stores DayTimeIntervalType as microseconds in an int64,
matching Arrow
+ // Duration(Microsecond) rather than the lossy Interval(DayTime) {days,
millis} layout.
+ case _: DayTimeIntervalType => new
ArrowType.Duration(TimeUnit.MICROSECOND)
Review Comment:
Verified correct, no change needed for this PR. The representation choices
match Spark's internal storage exactly: `YearMonthIntervalType` is int32
months, which maps cleanly to `Interval(YearMonth)`; `DayTimeIntervalType` is
an int64 microsecond count, which round-trips faithfully through
`Duration(Microsecond)`. The comment correctly notes that `Interval(DayTime)`'s
`{days, millis}` layout would lose microsecond precision, so avoiding it is
right.
One forward-looking note, not a blocker. Both mappings drop Spark's interval
field qualifiers. `YearMonthIntervalType(YEAR, YEAR)` and
`YearMonthIntervalType(YEAR, MONTH)` both serialize to the same
`Interval(YearMonth)`, and on the way back `fromArrowType` produces the default
`YearMonthIntervalType()` (YEAR TO MONTH); same story for `DayTimeIntervalType`
start/end fields. This is harmless here because the only expressions wired are
`make_ym_interval` and `make_dt_interval`, both of which Spark types with the
default fields (`YearMonthIntervalType()`, `DayTimeIntervalType()`), and
interval columns are not scanned yet. But once #4540 adds interval scans or
field-qualified interval handling, this normalization will change a column's
declared type. Worth a tracking note so it is not lost.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]