[
https://issues.apache.org/jira/browse/IMPALA-10785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384716#comment-17384716
]
Quanlong Huang commented on IMPALA-10785:
-----------------------------------------
Note that if the hdfs operand is the first operand, it can still be passed
through.
{code}
[localhost:21050] default> explain select max(id), count(string_col) from
(select id, string_col from functional.alltypes union all select id, string_col
from functional_kudu.alltypes) t;
Query: explain select max(id), count(string_col) from (select id, string_col
from functional.alltypes union all select id, string_col from
functional_kudu.alltypes) t
+------------------------------------------------------------------------------------+
| Explain String
|
+------------------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=4.03MB Threads=3
|
| Per-Host Resource Estimates: Memory=164MB
|
| WARNING: The following tables are missing relevant table and/or column
statistics. |
| functional.alltypes, functional_kudu.alltypes
|
|
|
| PLAN-ROOT SINK
|
| |
|
| 05:AGGREGATE [FINALIZE]
|
| | output: max:merge(id), count:merge(string_col)
|
| | row-size=12B cardinality=1
|
| |
|
| 04:EXCHANGE [UNPARTITIONED]
|
| |
|
| 03:AGGREGATE
|
| | output: max(id), count(string_col)
|
| | row-size=12B cardinality=1
|
| |
|
| 00:UNION
|
| | pass-through-operands: 01 <----------------- Pass through the hdfs scan
results
| | row-size=16B cardinality=6.12K
|
| |
|
| |--02:SCAN KUDU [functional_kudu.alltypes]
|
| | row-size=20B cardinality=unavailable
|
| |
|
| 01:SCAN HDFS [functional.alltypes]
|
| HDFS partitions=24/24 files=24 size=478.45KB
|
| row-size=16B cardinality=6.12K
|
+------------------------------------------------------------------------------------+
{code}
I'm still not clear how you resolve the first 2 points. But if you simply
change the kudu tuple layout in FE, it will break the BE where we use memcpy
here:
[https://github.com/apache/impala/blob/e11237e29ed3a41dd361dd7a541f9702d0d0b16b/be/src/exec/kudu-scanner.cc#L413].
{code:cpp}
void Tuple::DeepCopy(Tuple* dst, const TupleDescriptor& desc, MemPool* pool) {
memcpy(dst, this, desc.byte_size());
if (desc.HasVarlenSlots()) dst->DeepCopyVarlenData(desc, pool);
}
{code}
So we need to add changes there, which is exactly the materializion done in
UnionNode.
For the 3rd point, columns from hdfs tables are always nullable. I think what
we need to adjust is marking kudu's primary keys to be nullable as well (when
the kudu operand is simple, e.g. just a KuduScanNode), and then invoke
[TupleDescriptor#recomputeMemLayout()|https://github.com/apache/impala/blob/e11237e29ed3a41dd361dd7a541f9702d0d0b16b/fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java#L271]
to add the null-indicator bits.
In general, it acts like pushing down the computation (here is materializion)
from UnionNode to the operands. Maybe a more general solution is introducing a
transform operator on top of each non-pass-through operand to make their output
tuple aligns. This can be discussed in a sperate JIRA.
> when union kudu table and hdfs table, union passthrough does not take effect
> ----------------------------------------------------------------------------
>
> Key: IMPALA-10785
> URL: https://issues.apache.org/jira/browse/IMPALA-10785
> Project: IMPALA
> Issue Type: Improvement
> Reporter: pengdou1990
> Priority: Major
>
> IMPALA-3586 already supports union passthrough, and brings great performance
> improvements in union, but there is still some problems when union between
> hdfs table and kudu table ,several points cause the problem:
> # in kudu scanner node output TupleDescriptor, string slot is 16B,while in
> hdfs scanner node output TupleDescriptor, string slot is 12B,cause tuple
> memory layout mismatch
> # in kudu scanner node output TupleDescriptor, string slot is 16B, while in
> Union output TupleDescriptor, string slot is 12B,cause tuple memory layout
> mismatch
> # in Kudu Scannode, row key slot is not null, while in hdfs node, not null
> slot can't get from the metadata, cause tuple memory layout mismatch
> I hive resolved the 1st and 2nd points, how should I do with the 3rd point?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]