[ 
https://issues.apache.org/jira/browse/HIVE-12080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957237#comment-14957237
 ] 

Sergio Peña commented on HIVE-12080:
------------------------------------

I see that option #1 is the easier one. But if you do that, I prefer you create 
a new class for Parquet only, so these "if" do not affect other formats that 
Hive support. Other people that use other formats might be against this 
approach, so we would need to do it only for Parquet, something like 
{{ParquetShortInspector}} does it.

However, the "if" does not look that would add a lot of overhead, but calling 
the get() method for 1 billion rows, then performance overhead could be 
reflected on that. I'd need to run some tests to confirm it, but I am assuming 
it for now.

The option 2 is the one that could make sense because the getConverter() method 
is called only once per query. After that, the specific object inspector is the 
one called as many times as needed.

I had an idea similar to option 2 while looking into the code. The idea was to 
pass the correct Hive column type to getConverter(), and let the getConverter() 
method  decide which writable to use. For instance:
{code}
EINT32_CONVERTER(Integer.TYPE) {
    @Override
    PrimitiveConverter getConverter(final PrimitiveType type, final TypeInfo 
hiveType final int index, final ConverterParent parent) {
      if (hiveType.equals(TypeInfoFactory.longTypeInfo)) {
        return new PrimitiveConverter() {
          @Override
          public void addInt(final int value) {
            parent.set(index, new LongWritable((long)value));
          }
        }
      } else {
        return new PrimitiveConverter() {
          @Override
          public void addInt(final int value) {
            parent.set(index, new IntWritable(value));
          }
        };
      }
    }
  }
{code}

That way, Hive will use the WritableLongObjectInspector to read a long value 
all the time. It might need a better design, but it is just an idea. It will 
also work when converting the int -> short (as ParquetShortInspector does). If 
you want to investigate further about this approach, then you can investigate 
the {{DataWritableReadSupport}} as well. This is where the Parquet schema (with 
parquet types) are created based on the Hive types.

Btw, Hive is using Parquet 1.7, so we can use the PARQUET-2 approach as well. 

Let me know your thoughts.

> Support auto type widening for Parquet table
> --------------------------------------------
>
>                 Key: HIVE-12080
>                 URL: https://issues.apache.org/jira/browse/HIVE-12080
>             Project: Hive
>          Issue Type: New Feature
>          Components: File Formats
>            Reporter: Mohammad Kamrul Islam
>            Assignee: Mohammad Kamrul Islam
>
> Currently Hive+Parquet doesn't support it. It should include at least basic 
> type promotions short->int->bigint,  float->double etc, that are already 
> supported for  other file formats.
> There were similar effort (Hive-6784) but was not committed. This JIRA is to 
> address the same in different way with little (no) performance impact.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to