[
https://issues.apache.org/jira/browse/SPARK-8794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Armbrust updated SPARK-8794:
------------------------------------
Shepherd: Michael Armbrust
Assignee: Liang-Chi Hsieh
> Column pruning isn't applied beneath sample
> -------------------------------------------
>
> Key: SPARK-8794
> URL: https://issues.apache.org/jira/browse/SPARK-8794
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.4.0
> Reporter: Eron Wright
> Assignee: Liang-Chi Hsieh
>
> I observe that certain transformations (e.g. sample) on DataFrame cause the
> underlying relation's support for column pruning to be disregarded in
> subsequent queries.
> I encountered this issue while using an ML pipeline with a typical dataset of
> (label, features). For my particular data source (which implements
> PrunedScan), the 'features' column is expensive to compute while the 'label'
> column is cheap. The first stage of the pipeline - StringIndexer - operates
> only on the label and so should be quick. Yet I found that the 'features'
> column would be materialized. Upon investigation, the issue occurs when
> the dataset is split into train/test with sampling. The sampling
> transformation causes the pruning optimization to be lost.
> See this gist for a sample program demonstrating the issue:
> [https://gist.github.com/EronWright/cb5fb9af46fd810194f8]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]