Takeshi Yamamuro created HIVEMALL-184:
-----------------------------------------

             Summary: Add an optimizer rule to filter out columns by using 
Mutual Information
                 Key: HIVEMALL-184
                 URL: https://issues.apache.org/jira/browse/HIVEMALL-184
             Project: Hivemall
          Issue Type: Sub-task
            Reporter: Takeshi Yamamuro
            Assignee: Takeshi Yamamuro


Mutual Information (MI) is an indicator to find and quantify dependencies 
between variables, so the indicator is useful to filter out columns in feature 
selection. Nearest-neighbor distances are frequently used to estimate MI [1], 
so we could use the distances to compute MI between columns for each relation 
when running an ANALYZE command. Then, we could filter out "similar" columns in 
the optimizer phase by referring a new threshold (e.g. 
`spark.sql.optimizer.featureSelection.mutualInfoThreshold`).

In another story, we need to consider a light-weight way to update MI when 
re-running an ANALYZE command. A recent study [2] proposed a sophisticated 
technique to compute MI for dynamic data.

[1] Dafydd Evans, A computationally efficient estimator for mutual information.
In Proceedings of the Royal Society of London A: Mathematical, Physical
and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008.
[2] Michael Vollmer et al., On Complexity and Efficiency of Mutual Information
Estimation on Static and Dynamic Data, Proceedings of EDBT, 2018.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to