[
https://issues.apache.org/jira/browse/HIVE-12080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957237#comment-14957237
]
Sergio Peña commented on HIVE-12080:
------------------------------------
I see that option #1 is the easier one. But if you do that, I prefer you create
a new class for Parquet only, so these "if" do not affect other formats that
Hive support. Other people that use other formats might be against this
approach, so we would need to do it only for Parquet, something like
{{ParquetShortInspector}} does it.
However, the "if" does not look that would add a lot of overhead, but calling
the get() method for 1 billion rows, then performance overhead could be
reflected on that. I'd need to run some tests to confirm it, but I am assuming
it for now.
The option 2 is the one that could make sense because the getConverter() method
is called only once per query. After that, the specific object inspector is the
one called as many times as needed.
I had an idea similar to option 2 while looking into the code. The idea was to
pass the correct Hive column type to getConverter(), and let the getConverter()
method decide which writable to use. For instance:
{code}
EINT32_CONVERTER(Integer.TYPE) {
@Override
PrimitiveConverter getConverter(final PrimitiveType type, final TypeInfo
hiveType final int index, final ConverterParent parent) {
if (hiveType.equals(TypeInfoFactory.longTypeInfo)) {
return new PrimitiveConverter() {
@Override
public void addInt(final int value) {
parent.set(index, new LongWritable((long)value));
}
}
} else {
return new PrimitiveConverter() {
@Override
public void addInt(final int value) {
parent.set(index, new IntWritable(value));
}
};
}
}
}
{code}
That way, Hive will use the WritableLongObjectInspector to read a long value
all the time. It might need a better design, but it is just an idea. It will
also work when converting the int -> short (as ParquetShortInspector does). If
you want to investigate further about this approach, then you can investigate
the {{DataWritableReadSupport}} as well. This is where the Parquet schema (with
parquet types) are created based on the Hive types.
Btw, Hive is using Parquet 1.7, so we can use the PARQUET-2 approach as well.
Let me know your thoughts.
> Support auto type widening for Parquet table
> --------------------------------------------
>
> Key: HIVE-12080
> URL: https://issues.apache.org/jira/browse/HIVE-12080
> Project: Hive
> Issue Type: New Feature
> Components: File Formats
> Reporter: Mohammad Kamrul Islam
> Assignee: Mohammad Kamrul Islam
>
> Currently Hive+Parquet doesn't support it. It should include at least basic
> type promotions short->int->bigint, float->double etc, that are already
> supported for other file formats.
> There were similar effort (Hive-6784) but was not committed. This JIRA is to
> address the same in different way with little (no) performance impact.
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)