Max Gekk created SPARK-57162:
--------------------------------
Summary: Add nanosecond-aware TimestampFormatter for parsing and
formatting TimestampNanosVal with precision
Key: SPARK-57162
URL: https://issues.apache.org/jira/browse/SPARK-57162
Project: Spark
Issue Type: Sub-task
Components: SQL
Affects Versions: 4.3.0
Reporter: Max Gekk
h2. What
Extend the {{TimestampFormatter}} family so it can parse a string into
{{org.apache.spark.unsafe.types.TimestampNanosVal}} ({{epochMicros: Long}} +
{{nanosWithinMicro: Short}} in [0, 999]) and format a {{TimestampNanosVal}}
back to a string
with a target fractional precision {{p}} in [7, 9].
Parent: SPARK-56822. Builds on SPARK-57032 (raw string parsing for nanosecond
fractional
precision), which covers only {{SparkDateTimeUtils.parseTimestampString}}, not
the
pattern-based / format (write) side used by datasources.
h2. Why
Today {{TimestampFormatter}} is microsecond-only: every {{parse}} /
{{parseWithoutTimeZone}} returns a {{Long}} of epoch microseconds, and every
{{format}}
overload consumes microseconds. {{Iso8601TimestampFormatter.extractMicros}}
reads
{{ChronoField.MICRO_OF_SECOND}}, discarding the 7th-9th fractional digits, and
the legacy
{{FAST_DATE_FORMAT}} path caps at millisecond/microsecond resolution. There is
no API that
yields or consumes {{TimestampNanosVal}}.
The JSON and CSV datasources (and other text-based paths) drive all timestamp
parsing and
formatting through {{TimestampFormatter}} with user-supplied
{{timestampFormat}} patterns,
so they cannot round-trip 7-9 digit fractions until the formatter is
nanos-aware. This
ticket is the foundational unblocker for nanosecond support in those
datasources.
h2. Scope
{{sql/api/.../util/TimestampFormatter.scala}}
* Add nanos-aware parse methods returning {{TimestampNanosVal}} (LTZ and NTZ /
without-time-zone variants), and {{Optional}} counterparts mirroring
{{parseOptional}} /
{{parseWithoutTimeZoneOptional}}.
* Add format methods accepting {{TimestampNanosVal}} plus the target precision
{{p}}, with
defined truncation/rounding of sub-precision digits.
* Cover the implementations: {{Iso8601TimestampFormatter}} (extend
{{extractMicros}} to also
capture {{NANO_OF_SECOND}} remainder), {{DefaultTimestampFormatter}} (delegate
to the
SPARK-57032 nanos parse), and the legacy {{LegacyFastTimestampFormatter}}
(define behavior
or explicitly reject nanos in LEGACY mode).
* Support fraction patterns up to 9 digits ({{[.SSSSSSS]}} .. {{[.SSSSSSSSS]}})
in both parse
and format ({{DateTimeFormatterHelper}} already appends {{NANO_OF_SECOND}}
0..9).
h2. Out of scope
* JSON/CSV converter and schema-inference wiring (separate sub-tasks; they
depend on this).
* Raw string parsing already handled by SPARK-57032.
* Datasource option additions.
h2. Design notes
* Precision {{p}} controls how many fractional digits are emitted on format and
how
sub-precision input is handled on parse (truncate vs round) - document and test
the chosen
rule.
* Reuse the existing {{TimestampNanosVal}} normalization invariant
(nanosWithinMicro in
[0, 999]); carry overflow into {{epochMicros}}.
* Keep all existing microsecond methods unchanged (additive API).
h2. How was this patch tested
* {{TimestampFormatterSuite}} (or new cases): parse/format round-trip for p in
[7, 9] across
ISO default and custom patterns; boundary values (nanosWithinMicro 0 and 999,
pre-epoch
instants, Long micro boundaries); LEGACY-mode behavior; truncation/rounding
rule.
h2. Does this PR introduce any user-facing change
No. Additive formatter API gated for use behind
{{spark.sql.timestampNanosTypes.enabled}} by
its callers.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]