[ 
https://issues.apache.org/jira/browse/HIVEMALL-185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453705#comment-16453705
 ] 

Takeshi Yamamuro commented on HIVEMALL-185:
-------------------------------------------

I found a novel approach [1] about sampling over multi-way joins and ISTM this 
approach has good compatibilities with the Spark Catalyst.
[1] Z. Zhao, R. Christensen, F. Li, X. Hu, K. Yi, Random Sampling over Joins 
Revisited, Proceedings of SIGMOD, 2018.

> Add an optimizer rule to push down a Sample plan node into fact tables
> ----------------------------------------------------------------------
>
>                 Key: HIVEMALL-185
>                 URL: https://issues.apache.org/jira/browse/HIVEMALL-185
>             Project: Hivemall
>          Issue Type: Sub-task
>            Reporter: Takeshi Yamamuro
>            Assignee: Takeshi Yamamuro
>            Priority: Major
>
> Sampling is a common technique to extract a part of data in joined relations 
> (fact tables and dimension tables) for training data. The optimizer in Spark 
> cannot push down a Sample plan node into larger fact tables because this node 
> is non-deterministic. But, by using RI constraints, we could push down this 
> node into fact tables in some cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to