[ https://issues.apache.org/jira/browse/HIVEMALL-184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Makoto Yui closed HIVEMALL-184. ------------------------------- Resolution: Abandoned > Add an optimizer rule to filter out columns by using Mutual Information > ----------------------------------------------------------------------- > > Key: HIVEMALL-184 > URL: https://issues.apache.org/jira/browse/HIVEMALL-184 > Project: Hivemall > Issue Type: Sub-task > Reporter: Takeshi Yamamuro > Assignee: Takeshi Yamamuro > Priority: Major > Labels: spark > > Mutual Information (MI) is an indicator to find and quantify dependencies > between variables, so the indicator is useful to filter out columns in > feature selection. Nearest-neighbor distances are frequently used to estimate > MI [1], so we could use the distances to compute MI between columns for each > relation when running an ANALYZE command. Then, we could filter out "similar" > columns in the optimizer phase by referring a new threshold (e.g. > `spark.sql.optimizer.featureSelection.mutualInfoThreshold`). > In another story, we need to consider a light-weight way to update MI when > re-running an ANALYZE command. A recent study [2] proposed a sophisticated > technique to compute MI for dynamic data. > [1] Dafydd Evans, A computationally efficient estimator for mutual > information. In Proceedings of the Royal Society of London A: Mathematical, > Physical > and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008. > [2] Michael Vollmer et al., On Complexity and Efficiency of Mutual > Information Estimation on Static and Dynamic Data, Proceedings of EDBT, 2018. -- This message was sent by Atlassian Jira (v8.3.4#803005)