[
https://issues.apache.org/jira/browse/DRILL-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16549949#comment-16549949
]
ASF GitHub Bot commented on DRILL-6606:
---------------------------------------
ilooner commented on issue #1384: DRILL-6606: Fixed bug in HashJoin that caused
it not to return OK_NEW_SCHEMA in some cases.
URL: https://github.com/apache/drill/pull/1384#issuecomment-406432426
Thanks for the +1 .
With respect to your comment, calling prefetchFirstBatchFromBothSides from
buildSchema was actually the source of the problem. Doing so would trigger the
operator state to be BatchState.FIRST after calling buildSchema which would
cause an **OK_SCHEMA** to NOT be sent. This then cause downstream operators to
never build a correct schema and return incorrect data types in some cases.
That was the crux of the issue.
This change fixes that issue by separating prefetching data to two phases:
- Schema sniffing
- Data sniffing
The schemas need to be sniffed in the buildSchema call so we can have the
schema. After sniffing schemas that state of the operator is BUILD_SCHEMA and
OK_NEW_SCHEMA is emitted. Then data sniffing needs to happen in the call to
innerNext() after the operator has emitted an OK_NEW_SCHEMA message.
Other binary operators don't have this issue because they don't live within
their memory limit, and as a consequence do not need to collect statistics
about the data through sniffing.
Furthermore, doing the sniffing in two stages is not a hack. It is required
for functional correctness for queries like the one added in the unit test and
for the reasons described above.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Hash Join returns incorrect data types when joining subqueries with limit 0
> ---------------------------------------------------------------------------
>
> Key: DRILL-6606
> URL: https://issues.apache.org/jira/browse/DRILL-6606
> Project: Apache Drill
> Issue Type: Bug
> Reporter: Bohdan Kazydub
> Assignee: Timothy Farkas
> Priority: Blocker
> Fix For: 1.14.0
>
>
> PreparedStatement for query
> {code:sql}
> SELECT l.l_quantity, l.l_shipdate, o.o_custkey
> FROM (SELECT * FROM cp.`tpch/lineitem.parquet` LIMIT 0) l
> JOIN (SELECT * FROM cp.`tpch/orders.parquet` LIMIT 0) o
> ON l.l_orderkey = o.o_orderkey
> LIMIT 0
> {code}
> is created with wrong types (nullable INTEGER) for all selected columns, no
> matter what their actual type is. This behavior reproduces with hash join
> only and is very likely to be caused by DRILL-6027 as the query works fine
> before this feature was implemented.
> To reproduce the problem you can put the aforementioned query into
> TestPreparedStatementProvider#joinOrderByQuery() test method.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)