[ 
https://issues.apache.org/jira/browse/SPARK-26166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinyong Tian updated SPARK-26166:
---------------------------------
    Description: 
In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column

df = dataset.select("*", rand(seed).alias(randCol))

Should add

df.checkpoint()

If  df is  not checkpointed, it will be recomputed each time when train and 
validation dataframe need to be created. The order of rows in df,which 
rand(seed)  is dependent on, is not deterministic . Thus each time random 
column value could be different for a specific row even with seed. Note , 
checkpoint() can not be replaced with cached(), because when a node fails, 
cached table might stilled be  recomputed, thus random number could be 
different.

This might especially  be a problem when input 'dataset' dataframe is resulted 
from a query including 'where' clause. see below.

[https://dzone.com/articles/non-deterministic-order-for-select-with-limit]

 

 

 

  was:
In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column

df = dataset.select("*", rand(seed).alias(randCol))

Should add

df.cache()

If  df not cached, it will be reselect each time when train and validation 
dataframe need to be created. The order of rows in df,which rand(seed)  is 
dependent on, is not deterministic . Thus each time random column value could 
be different for a specific row even with seed.

This might especially  be a problem when input 'dataset' dataframe is resulted 
from a query including 'where' clause. see below.

https://dzone.com/articles/non-deterministic-order-for-select-with-limit

 


> CrossValidator.fit() bug,training and validation dataset may overlap
> --------------------------------------------------------------------
>
>                 Key: SPARK-26166
>                 URL: https://issues.apache.org/jira/browse/SPARK-26166
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.3.0
>            Reporter: Xinyong Tian
>            Priority: Major
>
> In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column
> df = dataset.select("*", rand(seed).alias(randCol))
> Should add
> df.checkpoint()
> If  df is  not checkpointed, it will be recomputed each time when train and 
> validation dataframe need to be created. The order of rows in df,which 
> rand(seed)  is dependent on, is not deterministic . Thus each time random 
> column value could be different for a specific row even with seed. Note , 
> checkpoint() can not be replaced with cached(), because when a node fails, 
> cached table might stilled be  recomputed, thus random number could be 
> different.
> This might especially  be a problem when input 'dataset' dataframe is 
> resulted from a query including 'where' clause. see below.
> [https://dzone.com/articles/non-deterministic-order-for-select-with-limit]
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to