[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

liancheng Mon, 29 Jun 2015 10:18:10 -0700

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/6796#issuecomment-116766442
  
    Hi @rtreffer,
    
    > - How is the compatibility mode intended to work? Settings are currently 
private, but I'd like to store Decimal(19), so is lifting the 18 limit correct 
for compatibility mode?
    
    The compatibility mode is enabled by setting 
`spark.sql.parquet.followParquetFormatSpec` to `false`.  This mode must be 
enabled for now, because the write path hasn't been refactored to follow the 
Parquet format spec.  Note that compatibility mode only affects the write path, 
because the Parquet format spec also covers legacy formats by various 
backwards-compatibilty rules.
    
    Decimals with precision > 18 could be enabled even in compatibility mode.  
Because it doesn't affect compatibility: old Spark versions can't read decimals 
with precision > 18 from the very beginning.
    
    What do you mean by saying "settings are currently private"?  
`SQLConf.PARQUET_FOLLOW_PARQUET_FORMAT_SPEC` is `private[spark]`, all classes 
under `org.apache.spark` can access it.
    
    > - INT32/INT64 are only used when the byte length matches the byte length 
for the precision. FIXED_LEN_BYTE_ARRAY will thus e.g. be used to store 6 byte 
values
    
    I see your point.  You mentioned a "debate" in [this comment] [1], were you 
referring to [this one] [2]?  From the perspective of storage efficiency, it 
probably makes sense.  (I said "probably" because I'm not quite sure about the 
average case after taking encoding/compression into consideration.)  However, 
in the case of Parquet, we usually care more about speed and memory 
consumption.  Especially, Parquet can be super memory consuming when reading 
files with wide schema (i.e., large column number).  A key advantage of `INT32` 
and `INT64` is that, they avoid boxing costs in many cases and thus can be 
faster and use less memory.  Also, you don't need to do all those bit 
operations to encode/decode the unscaled long value of a decimal when using 
`INT32` and `INT64`.
    
    At the meantime, Parquet handles `INT32` and `INT64` pretty efficiently.  
There are more encoders for integral types than binaries (either fixed-length 
or not, see [Encodings.md] [3] for more details).  Although I haven't done 
benchmark for this, but I believe in many cases, storage efficiency of `INT32` 
can be comparable or even better than `FIXED_LEN_BYTE_ARRAY` with a length less 
than 4. The same should also applies to `INT64`.
    
    So I suggest: when compatibility mode is off, we just use `INT32` for 1 <= 
precision <= 9, and `INT64` for 10 <= precision <= 18 when converting 
`DecimalType`s in `CatalystSchemaConverter`.  When we refactor the write path 
to follow Parquet format spec, we can write decimals in `INT32` and `INT64` 
when appropriate in follow-up PRs.
    
    The TL;DR is: I'd just remove `precision <= maxPrecisionForBytes(8)` in 
[this line] [4] and leave everything else unmodified (you comment updates looks 
good to me though :)
    
    > - FIXED_LEN_BYTE_ARRAY means I'll have to create an array of the correct 
size. I've increased the scratch_bytes. Not very happy about the code path, do 
you have better ideas?
    
    Hive limits the max precision of a decimal to 38, which fits in 16 bytes.  
So 16 rather than 4096 bytes should be enough for most cases.  Also it would be 
better to refactor branches of [this `if` expression] [5] into two separate 
methods for clarity.  Otherwise it looks good.
    
    > - BYTES_FOR_PRECISION needs to handle any precision. I've reworked that 
code. Again, suggestions welcome
    
    (See my other comments inlined.)
    
    [1]: https://github.com/apache/spark/pull/6796#discussion_r33420742
    [2]: https://github.com/apache/spark/pull/6796#discussion_r32891515
    [3]: https://github.com/Parquet/parquet-format/blob/master/Encodings.md
    [4]: 
https://github.com/apache/spark/pull/6796/files#diff-a4c01298c63223d113645a31c01141baL377
    [5]: 
https://github.com/apache/spark/pull/6796/files#diff-83ef4d5f1029c8bebb49a0c139fa3154R301




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

Reply via email to