[jira] [Work logged] (HIVE-26320) Incorrect case evaluation for Parquet based table

ASF GitHub Bot (Jira) Fri, 30 Sep 2022 03:41:35 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-26320?focusedWorklogId=813654&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-813654
 ]


ASF GitHub Bot logged work on HIVE-26320:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 30/Sep/22 10:40
            Start Date: 30/Sep/22 10:40
    Worklog Time Spent: 10m 
      Work Description: zabetak commented on code in PR #3628:
URL: https://github.com/apache/hive/pull/3628#discussion_r984418472


##########
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ETypeConverter.java:
##########
@@ -481,12 +485,36 @@ protected BytesWritable convert(Binary binary) {
   },
   ESTRING_CONVERTER(String.class) {
     @Override
-    PrimitiveConverter getConverter(final PrimitiveType type, final int index, 
final ConverterParent parent, TypeInfo hiveTypeInfo) {
+    PrimitiveConverter getConverter(final PrimitiveType type, final int index, 
final ConverterParent parent,
+        TypeInfo hiveTypeInfo) {
+      // If we have type information, we should return properly typed strings. 
However, there are a variety
+      // of code paths that do not provide the typeInfo in those cases we 
default to Text. This idiom is also
+      // followed by for example the BigDecimal converter in which if there is 
no type information,
+      // it defaults to the widest representation
+      if (hiveTypeInfo != null) {
+        String typeName = hiveTypeInfo.getTypeName().toLowerCase();
+        if (typeName.startsWith(serdeConstants.CHAR_TYPE_NAME)) {
+          return new BinaryConverter<HiveCharWritable>(type, parent, index) {
+            @Override
+              protected HiveCharWritable convert(Binary binary) {
+                return new HiveCharWritable(binary.getBytes(), ((CharTypeInfo) 
hiveTypeInfo).getLength());
+              }
+          };
+        } else if (typeName.startsWith(serdeConstants.VARCHAR_TYPE_NAME)) {
+          return new BinaryConverter<HiveVarcharWritable>(type, parent, index) 
{
+            @Override
+              protected HiveVarcharWritable convert(Binary binary) {
+                return new HiveVarcharWritable(binary.getBytes(), 
((VarcharTypeInfo) hiveTypeInfo).getLength());
+              }
+          };
+        }
+      }

Review Comment:
   I think the most common way to do something based on a primitive type is to 
use its category. It seems robust and more efficient.
   ```java
   if (hiveTypeInfo instanceof PrimitiveTypeInfo) {
           PrimitiveTypeInfo t = (PrimitiveTypeInfo) hiveTypeInfo;
           switch (t.getPrimitiveCategory()) {
           case VARCHAR:
             break;
           case CHAR:
             break;
           }
         }
   ```
   I see that in this class they rely much on the name and serdeConstants so I 
am also OK with the current code as it is.



##########
ql/src/test/org/apache/hadoop/hive/ql/io/parquet/convert/TestETypeConverter.java:
##########
@@ -212,23 +216,59 @@ public void 
testGetInt64NanosAdjustedToUTCTimestampConverter() throws Exception
   }
 
   @Test
-  public void testGetTextConverter() throws Exception {
+  public void testGetTextConverterForString() throws Exception {
     PrimitiveType primitiveType = Types.optional(PrimitiveTypeName.BINARY)
-        .as(LogicalTypeAnnotation.stringType()).named("value");
-    Writable writable = getWritableFromBinaryConverter(new VarcharTypeInfo(), 
primitiveType,
-        Binary.fromString("this_is_a_value"));
-    Text textWritable = (Text) writable;
-    assertEquals("this_is_a_value", textWritable.toString());
+            .as(LogicalTypeAnnotation.stringType()).named("value");
+    String value = "this_is_a_value";
+    Text textWritable = (Text) getWritableFromBinaryConverter(stringTypeInfo, 
primitiveType,
+            Binary.fromString(value));
+    // we should get what we put in
+    assertEquals(value, textWritable.toString());
+  }
+
+  @Test
+  public void testGetTextConverterForChar() throws Exception {
+    PrimitiveType primitiveType = Types.optional(PrimitiveTypeName.BINARY)
+            .as(LogicalTypeAnnotation.stringType()).named("value");
+    String value = "this_is_a_value";
+
+    HiveCharWritable textWritable = (HiveCharWritable) 
getWritableFromBinaryConverter(
+            new CharTypeInfo(value.length() + 2), primitiveType, 
Binary.fromString(value));
+    // check that it enforces the length (it should be padded)
+    assertEquals(value + "  ", textWritable.toString());
+
+    textWritable = (HiveCharWritable) getWritableFromBinaryConverter(
+            new CharTypeInfo(6), primitiveType, Binary.fromString(value));
+    // check that it enforces the length (it should truncate)
+    assertEquals(value.substring(0, 6), textWritable.toString());

Review Comment:
   nit. The test could be easily split in two and a proper name could avoid the 
need for extra comments
   * testGetTextConverterForCharPadsValueWithSpacesTillLen
   * testGetTextConverterForCharTruncatesValueExceedingLen



##########
ql/src/test/queries/clientpositive/pointlookup.q:
##########
@@ -127,3 +127,39 @@ or inOutputOpt.key = null;
 drop table orOutput;
 drop table inOutput;
 drop table inOutputOpt;
+
+

Issue Time Tracking
-------------------

    Worklog Id:     (was: 813654)
    Time Spent: 3h 20m  (was: 3h 10m)

> Incorrect case evaluation for Parquet based table
> -------------------------------------------------
>
>                 Key: HIVE-26320
>                 URL: https://issues.apache.org/jira/browse/HIVE-26320
>             Project: Hive
>          Issue Type: Bug
>          Components: HiveServer2, Query Planning
>    Affects Versions: 4.0.0-alpha-1
>            Reporter: Chiran Ravani
>            Assignee: John Sherman
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Query involving case statement with two or more conditions leads to incorrect 
> result for tables with parquet format, The problem is not observed with ORC 
> or TextFile.
> *Steps to reproduce*:
> {code:java}
> create external table case_test_parquet(kob varchar(2),enhanced_type_code 
> int) stored as parquet;
> insert into case_test_parquet values('BB',18),('BC',18),('AB',18);
> select case when (
>                    (kob='BB' and enhanced_type_code='18')
>                    or (kob='BC' and enhanced_type_code='18')
>                  )
>             then 1
>             else 0
>         end as logic_check
> from case_test_parquet;
> {code}
> Result:
> {code}
> 0
> 0
> 0
> {code}
> Expected result:
> {code}
> 1
> 1
> 0
> {code}
> The problem does not appear when setting hive.optimize.point.lookup=false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work logged] (HIVE-26320) Incorrect case evaluation for Parquet based table

Reply via email to