Github user liancheng commented on the pull request:
https://github.com/apache/spark/pull/6796#issuecomment-116766442
Hi @rtreffer,
> - How is the compatibility mode intended to work? Settings are currently
private, but I'd like to store Decimal(19), so is lifting the 18 limit correct
for compatibility mode?
The compatibility mode is enabled by setting
`spark.sql.parquet.followParquetFormatSpec` to `false`. This mode must be
enabled for now, because the write path hasn't been refactored to follow the
Parquet format spec. Note that compatibility mode only affects the write path,
because the Parquet format spec also covers legacy formats by various
backwards-compatibilty rules.
Decimals with precision > 18 could be enabled even in compatibility mode.
Because it doesn't affect compatibility: old Spark versions can't read decimals
with precision > 18 from the very beginning.
What do you mean by saying "settings are currently private"?
`SQLConf.PARQUET_FOLLOW_PARQUET_FORMAT_SPEC` is `private[spark]`, all classes
under `org.apache.spark` can access it.
> - INT32/INT64 are only used when the byte length matches the byte length
for the precision. FIXED_LEN_BYTE_ARRAY will thus e.g. be used to store 6 byte
values
I see your point. You mentioned a "debate" in [this comment] [1], were you
referring to [this one] [2]? From the perspective of storage efficiency, it
probably makes sense. (I said "probably" because I'm not quite sure about the
average case after taking encoding/compression into consideration.) However,
in the case of Parquet, we usually care more about speed and memory
consumption. Especially, Parquet can be super memory consuming when reading
files with wide schema (i.e., large column number). A key advantage of `INT32`
and `INT64` is that, they avoid boxing costs in many cases and thus can be
faster and use less memory. Also, you don't need to do all those bit
operations to encode/decode the unscaled long value of a decimal when using
`INT32` and `INT64`.
At the meantime, Parquet handles `INT32` and `INT64` pretty efficiently.
There are more encoders for integral types than binaries (either fixed-length
or not, see [Encodings.md] [3] for more details). Although I haven't done
benchmark for this, but I believe in many cases, storage efficiency of `INT32`
can be comparable or even better than `FIXED_LEN_BYTE_ARRAY` with a length less
than 4. The same should also applies to `INT64`.
So I suggest: when compatibility mode is off, we just use `INT32` for 1 <=
precision <= 9, and `INT64` for 10 <= precision <= 18 when converting
`DecimalType`s in `CatalystSchemaConverter`. When we refactor the write path
to follow Parquet format spec, we can write decimals in `INT32` and `INT64`
when appropriate in follow-up PRs.
The TL;DR is: I'd just remove `precision <= maxPrecisionForBytes(8)` in
[this line] [4] and leave everything else unmodified (you comment updates looks
good to me though :)
> - FIXED_LEN_BYTE_ARRAY means I'll have to create an array of the correct
size. I've increased the scratch_bytes. Not very happy about the code path, do
you have better ideas?
Hive limits the max precision of a decimal to 38, which fits in 16 bytes.
So 16 rather than 4096 bytes should be enough for most cases. Also it would be
better to refactor branches of [this `if` expression] [5] into two separate
methods for clarity. Otherwise it looks good.
> - BYTES_FOR_PRECISION needs to handle any precision. I've reworked that
code. Again, suggestions welcome
(See my other comments inlined.)
[1]: https://github.com/apache/spark/pull/6796#discussion_r33420742
[2]: https://github.com/apache/spark/pull/6796#discussion_r32891515
[3]: https://github.com/Parquet/parquet-format/blob/master/Encodings.md
[4]:
https://github.com/apache/spark/pull/6796/files#diff-a4c01298c63223d113645a31c01141baL377
[5]:
https://github.com/apache/spark/pull/6796/files#diff-83ef4d5f1029c8bebb49a0c139fa3154R301
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]