Yeah, initial nanosec timestamp support in Spark SQL follows Impala and
uses INT96 to improve interoperability with Impala. In Spark
1.5.0-SNAPSHOT (the current master branch), although we still write
timestamps as INT96, internally Spark SQL only uses a LONG to represent
timestamps for better performance. The cost is that the precision is
lowered to 100ns.
Since INT96 is being deprecated, what's the suggested/planned way to
read/write high precision nanosec timestamps then? Spark SQL, Hive, and
Impala all have nanosec timestamp type, while Parquet format spec
doesn't include it (only TIMESTAMP_MILLIS and TIMESTAMP_MICROS are
available for now). Should we add a TIMESTAMP_NANOS annotation over
FIXED_LENGTH_BYTE_ARRAY(12) and corresponding backwards-compatibility rules?
Cheng
On 6/24/15 1:21 PM, Nathan Howell wrote:
On 6/24/15, 1:17 PM, "Ryan Blue" <[email protected]> wrote:
:(
We'll want to deprecate those and move away from them. We're trying to
get support for real timestamps, along with backward-compatibility for
existing data, as soon as possible. I'm trying to get a commitment for
the next point release of CDH to fix it.
Actually it seems to have been added in 1.3.0, not 1.4.0:
https://issues.apache.org/jira/browse/SPARK-4987
-n