[jira] [Updated] (HIVE-19200) Vectorization: Disable vectorization for LLAP I/O when a non-VECTORIZED_INPUT_FILE_FORMAT mode is needed (i.e. rows) and data type conversion is needed

Matt McCline (JIRA) Fri, 13 Apr 2018 09:38:42 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-19200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Matt McCline updated HIVE-19200:
--------------------------------
    Description: 
Disable vectorization for issue in HIVE-18763 until we can do the harder VRB 
conversion code.

The main changes are:

1) In the Vectorizer, detect if data type conversion is needed between the 
partition and the desired table schema.  If so and LLAP I/O is enabled that 
does encoded catching, then do not vectorize.  Why? When LLAP I/O is in encoded 
catching mode, it delivers VectorizedRowBatch (VRB) to the VectorMapOperator 
instead of (object) rows.  We currently do not have logic for converting VRBs.  
So, we either get Wrong Results or more likely ClassCastException on the 
expected vs actual ColumnVector columns.

2) Cleaned up error message logic.that was suppressing the new message from the 
EXPLAIN VECTORIZATION display.

3) NOTE: Some of the SELECT statements in the schema_evol_test*.q are commented 
out because I bumped into a another bug.  I'll file that one soon and add 
comments to the Q files. 

---------------------------------------------------------------------------------------------------------------------------------------------------------------

The longer-term solution can be done later in steps:

1) Write a new code that can take a VectorizedRowBatch (VRB) and convert 
columns to different data types.  This is needed when LLAP is doing its 
encoding / caching and feeds VRBs to VectorMapOperator instead of rows.  
Similar to what MapOperator does today, VectorMapOperator would need to be 
enhanced to convert partition VRBs into the table schema VRBs that the vector 
operator tree expect.

2) Today, vectorization logic is strictly positional based.  It insists that 
the partition columns have the same names as the table schema.  The MapOperator 
(and ORC) does more general conversion that uses column names instead of column 
position.  We'd need to enhance all 3 classes to handle column name based 
conversion.  The 3 classes are: the new VRB-to-VRB conversion class, 
VectorDeserializeRow, and VectorAssignRow.

  was:
Disable vectorization for issue in HIVE-18763 until we can do the harder VRB 
conversion code.

The main changes are:

1) In the Vectorizer, detect if data type conversion is needed between the 
partition and the desired table schema.  If so and LLAP I/O is enabled that 
does encoded catching, then do not vectorize.  Why? When LLAP I/O is in encoded 
catching mode, it delivers VectorizedRowBatch (VRB) to the VectorMapOperator 
instead of (object) rows.  We currently do not have logic for converting VRBs.  
So, we either get Wrong Results or more likely ClassCastException on the 
expected vs actual ColumnVector columns.

2) Cleaned up error message logic.that was suppressing the new message from the 
EXPLAIN VECTORIZATION display.

3) NOTE: Some of the SELECT statements int the schema_evol_test*.q are 
commented out because I bumped into a another bug.  I'll file that one soon and 
add comments to the Q files. 

---------------------------------------------------------------------------------------------------------------------------------------------------------------

The longer-term solution can be done later in steps:

1) Write a new code that can take a VectorizedRowBatch (VRB) and convert 
columns to different data types.  This is needed when LLAP is doing its 
encoding / caching and feeds VRBs to VectorMapOperator instead of rows.  
Similar to what MapOperator does today, VectorMapOperator would need to be 
enhanced to convert partition VRBs into the table schema VRBs that the vector 
operator tree expect.

2) Today, vectorization logic is strictly positional based.  It insists that 
the partition columns have the same names as the table schema.  The MapOperator 
(and ORC) does more general conversion that uses column names instead of column 
position.  We'd need to enhance all 3 classes to handle column name based 
conversion.  The 3 classes are: the new VRB-to-VRB conversion class, 
VectorDeserializeRow, and VectorAssignRow.


> Vectorization: Disable vectorization for LLAP I/O when a 
> non-VECTORIZED_INPUT_FILE_FORMAT mode is needed (i.e. rows) and data type 
> conversion is needed
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-19200
>                 URL: https://issues.apache.org/jira/browse/HIVE-19200
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 3.0.0
>            Reporter: Matt McCline
>            Assignee: Matt McCline
>            Priority: Critical
>             Fix For: 3.0.0
>
>         Attachments: HIVE-19200.01.patch
>
>
> Disable vectorization for issue in HIVE-18763 until we can do the harder VRB 
> conversion code.
> The main changes are:
> 1) In the Vectorizer, detect if data type conversion is needed between the 
> partition and the desired table schema.  If so and LLAP I/O is enabled that 
> does encoded catching, then do not vectorize.  Why? When LLAP I/O is in 
> encoded catching mode, it delivers VectorizedRowBatch (VRB) to the 
> VectorMapOperator instead of (object) rows.  We currently do not have logic 
> for converting VRBs.  So, we either get Wrong Results or more likely 
> ClassCastException on the expected vs actual ColumnVector columns.
> 2) Cleaned up error message logic.that was suppressing the new message from 
> the EXPLAIN VECTORIZATION display.
> 3) NOTE: Some of the SELECT statements in the schema_evol_test*.q are 
> commented out because I bumped into a another bug.  I'll file that one soon 
> and add comments to the Q files. 
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
> The longer-term solution can be done later in steps:
> 1) Write a new code that can take a VectorizedRowBatch (VRB) and convert 
> columns to different data types.  This is needed when LLAP is doing its 
> encoding / caching and feeds VRBs to VectorMapOperator instead of rows.  
> Similar to what MapOperator does today, VectorMapOperator would need to be 
> enhanced to convert partition VRBs into the table schema VRBs that the vector 
> operator tree expect.
> 2) Today, vectorization logic is strictly positional based.  It insists that 
> the partition columns have the same names as the table schema.  The 
> MapOperator (and ORC) does more general conversion that uses column names 
> instead of column position.  We'd need to enhance all 3 classes to handle 
> column name based conversion.  The 3 classes are: the new VRB-to-VRB 
> conversion class, VectorDeserializeRow, and VectorAssignRow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-19200) Vectorization: Disable vectorization for LLAP I/O when a non-VECTORIZED_INPUT_FILE_FORMAT mode is needed (i.e. rows) and data type conversion is needed

Reply via email to