Matt McCline created HIVE-17433:
-----------------------------------
Summary: Vectorization: Support Decimal64 in Hive Query Engine
Key: HIVE-17433
URL: https://issues.apache.org/jira/browse/HIVE-17433
Project: Hive
Issue Type: Bug
Components: Hive
Reporter: Matt McCline
Assignee: Matt McCline
Priority: Critical
Provide partial support for Decimal64 within Hive. By partial I mean that our
current decimal has a large surface area of features (rounding, multiply,
divide, remainder, power, big precision, and many more) but only a small number
has been identified as being performance hotspots.
Those are small precision decimals with precision <= 18 that fit within a
64-bit long we are calling Decimal64 . Just as we optimize row-mode execution
engine hotspots by selectively adding new vectorization code, we can treat the
current decimal as the full featured one and add additional Decimal64
optimization where query benchmarks really show it help.
This change creates a Decimal64ColumnVector.
This change currently detects small decimal with Hive for Vectorized text input
format and uses some new Decimal64 vectorized classes for comparison, addition,
and later perhaps a few GroupBy aggregations like sum, avg, min, max.
The patch also supports a new annotation that can mark a VectorizedInputFormat
as supporting Decimal64 (it is called DECIMAL_64). So, in separate work those
other formats such as ORC, PARQUET, etc can be done in later JIRAs so they
participate in the Decimal64 performance optimization.
The idea is when you annotate your input format with:
@VectorizedInputFormatSupports(supports = {DECIMAL_64})
the Vectorizer in Hive will plan usage of Decimal64ColumnVector instead of
DecimalColumnVector. Upon an input format seeing Decimal64ColumnVector being
used, the input format can fill that column vector with decimal64 longs instead
of HiveDecimalWritable objects of DecimalColumnVector.
There will be a Hive environment variable
hive.vectorized.input.format.supports.enabled that has a string list of
supported features. The default will start as "decimal_64". It can be turned
off to allow for performance comparisons and testing.
The query SELECT * FROM DECIMAL_6_1_txt where key - 100BD < 200BD ORDER BY key,
value
Will have a vectorized explain plan looking like:
...
Filter Operator
Filter Vectorization:
className: VectorFilterOperator
native: true
predicateExpression:
FilterDecimal64ColLessDecimal64Scalar(col 2, val 20000000)(children:
Decimal64ColSubtractDecimal64Scalar(col 0, val 10000000, outputDecimal64AbsMax
99999999999) -> 2:decimal(11,5)/DECIMAL_64) -> boolean
predicate: ((key - 100) < 200) (type: boolean)
...
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)