[ 
https://issues.apache.org/jira/browse/SPARK-8794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8794:
------------------------------------
    Shepherd: Michael Armbrust
    Assignee: Liang-Chi Hsieh

> Column pruning isn't applied beneath sample
> -------------------------------------------
>
>                 Key: SPARK-8794
>                 URL: https://issues.apache.org/jira/browse/SPARK-8794
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.0
>            Reporter: Eron Wright 
>            Assignee: Liang-Chi Hsieh
>
> I observe that certain transformations (e.g. sample) on DataFrame cause the 
> underlying relation's support for column pruning to be disregarded in 
> subsequent queries.
> I encountered this issue while using an ML pipeline with a typical dataset of 
> (label, features).   For my particular data source (which implements 
> PrunedScan), the 'features' column is expensive to compute while the 'label' 
> column is cheap.  The first stage of the pipeline - StringIndexer - operates 
> only on the label and so should be quick.   Yet I found that the 'features' 
> column would be materialized.   Upon investigation,  the issue occurs when 
> the dataset is split into train/test with sampling.   The sampling 
> transformation causes the pruning optimization to be lost.
> See this gist for a sample program demonstrating the issue:
> [https://gist.github.com/EronWright/cb5fb9af46fd810194f8]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to