[
https://issues.apache.org/jira/browse/HIVE-18908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matt McCline updated HIVE-18908:
--------------------------------
Status: In Progress (was: Patch Available)
> FULL OUTER JOIN to MapJoin
> --------------------------
>
> Key: HIVE-18908
> URL: https://issues.apache.org/jira/browse/HIVE-18908
> Project: Hive
> Issue Type: Improvement
> Components: Hive
> Reporter: Matt McCline
> Assignee: Matt McCline
> Priority: Critical
> Attachments: FULL OUTER MapJoin Code Changes.docx,
> HIVE-18908.01.patch, HIVE-18908.02.patch, HIVE-18908.03.patch,
> HIVE-18908.04.patch, HIVE-18908.05.patch, HIVE-18908.06.patch,
> HIVE-18908.08.patch, HIVE-18908.09.patch, HIVE-18908.091.patch,
> HIVE-18908.092.patch, HIVE-18908.093.patch, HIVE-18908.096.patch,
> HIVE-18908.097.patch, HIVE-18908.098.patch, HIVE-18908.099.patch,
> HIVE-18908.0991.patch, HIVE-18908.0992.patch, HIVE-18908.0993.patch,
> HIVE-18908.0994.patch, HIVE-18908.0995.patch, HIVE-18908.0996.patch,
> HIVE-18908.0997.patch, HIVE-18908.0998.patch, HIVE-18908.0999.patch,
> HIVE-18908.09991.patch, HIVE-18908.09992.patch, HIVE-18908.09993.patch,
> HIVE-18908.09994.patch, HIVE-18908.09995.patch, HIVE-18908.09996.patch, JOIN
> to MAPJOIN Transformation.pdf, SHARED-MEMORY FULL OUTER MapJoin.pdf
>
>
> Currently, we do not support FULL OUTER JOIN in MapJoin.
> Rough TPC-DS timings run on laptop:
> (NOTE: Query 51 has PTF as a bigger serial portion -- Amdahl's law at play)
> FULL OUTER MapJoin OFF = MergeJoin
> Query 51:
> o Vectorization OFF
> • FULL OUTER MapJoin OFF: 4:30 minutes
> • FULL OUTER MapJoin ON: 4:37 minutes
> o Vectorization ON
> • FULL OUTER MapJoin OFF: 2:35 minutes
> • FULL OUTER MapJoin ON: 1:47 minutes
> Query 97:
> o Vectorization OFF
> • FULL OUTER MapJoin OFF: 2:37 minutes
> • FULL OUTER MapJoin ON: 2:42 minutes
> o Vectorization ON
> • FULL OUTER MapJoin OFF: 1:17 minutes
> • FULL OUTER MapJoin ON: 0:06 minutes
> FULL OUTER Join 10,000,000 rows against 323,910 small table keys
> o Vectorization ON
> • FULL OUTER MapJoin OFF: 14:56 minutes
> • FULL OUTER MapJoin ON: 1:45 minutes
> FULL OUTER Join 10,000,000 rows against 1,000 small table keys
> o Vectorization ON
> • FULL OUTER MapJoin OFF: 12:37 minutes
> • FULL OUTER MapJoin ON: 1:38 minutes
> Hopefully, someone will do large scale cluster testing.
> [DynamicPartitionedHashJoin] MapJoin should scale dramatically better than
> [Sort] MergeJoin reduce-shuffle.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)