[ https://issues.apache.org/jira/browse/HIVEMALL-185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453705#comment-16453705 ]
Takeshi Yamamuro commented on HIVEMALL-185: ------------------------------------------- I found a novel approach [1] about sampling over multi-way joins and ISTM this approach has good compatibilities with the Spark Catalyst. [1] Z. Zhao, R. Christensen, F. Li, X. Hu, K. Yi, Random Sampling over Joins Revisited, Proceedings of SIGMOD, 2018. > Add an optimizer rule to push down a Sample plan node into fact tables > ---------------------------------------------------------------------- > > Key: HIVEMALL-185 > URL: https://issues.apache.org/jira/browse/HIVEMALL-185 > Project: Hivemall > Issue Type: Sub-task > Reporter: Takeshi Yamamuro > Assignee: Takeshi Yamamuro > Priority: Major > > Sampling is a common technique to extract a part of data in joined relations > (fact tables and dimension tables) for training data. The optimizer in Spark > cannot push down a Sample plan node into larger fact tables because this node > is non-deterministic. But, by using RI constraints, we could push down this > node into fact tables in some cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005)