[ https://issues.apache.org/jira/browse/SPARK-24078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zhangsongcheng updated SPARK-24078: ----------------------------------- Description: I try to sample the traning sets with each category,and then uion all samples together.This is my code: {{ def balanceCategory(dataSet: DataFrame): DataFrame = {}} {{ val samples = LabelConf.categorys.map { }}category => {{ val tmpDataSet = dataSet.filter(col("category_id") === category)}} val sample = underSample(tmpDataSet, category) sample } {{ samples.reduce((x, y) => x.unionAll(y))}} } {{ def underSample(dataSet: DataFrame, cardID: String): DataFrame = { val positiveSample = dataSet.filter(col("label") > 0.5).sample(false, 0.1)}} {{ val negativeSample = dataSet.filter(col("label") < 0.5).sample(false, 0.1)}} {{ val positiveSample.unionAll(negativeSample)}} } But the code blocked in `{{samples.reduce((x, y) => x.unionAll(y))`}}, and it runs slowly and slowly, and even cannot run any more.It confused me a long time.Who can help me? Than you! was: I try to sample the traning sets with each category,and then uion all samples together.This is my code: {{ def balanceCategory(dataSet: DataFrame): DataFrame = {}} {{ val samples = LabelConf.categorys.map { }}{{category => }} {{ val tmpDataSet = dataSet.filter(col("category_id") === category)}} {{ val sample = underSample(tmpDataSet, category) sample }} {{ } }} {{ samples.reduce((x, y) => x.unionAll(y))}} {{ } }} {{ def underSample(dataSet: DataFrame, cardID: String): DataFrame = { val positiveSample = dataSet.filter(col("label") > 0.5).sample(false, 0.1)}} {{ val negativeSample = dataSet.filter(col("label") < 0.5).sample(false, 0.1)}} {{ val positiveSample.unionAll(negativeSample)}} } But the code blocked in `{{samples.reduce((x, y) => x.unionAll(y))`}}, and it runs slowly and slowly, and even cannot run any more.It confused me a long time.Who can help me? Than you! > reduce with unionAll takes a long time > -------------------------------------- > > Key: SPARK-24078 > URL: https://issues.apache.org/jira/browse/SPARK-24078 > Project: Spark > Issue Type: Bug > Components: Build > Affects Versions: 1.6.3 > Reporter: zhangsongcheng > Priority: Major > > I try to sample the traning sets with each category,and then uion all samples > together.This is my code: > {{ def balanceCategory(dataSet: DataFrame): DataFrame = {}} > {{ val samples = LabelConf.categorys.map { }}category => > {{ val tmpDataSet = dataSet.filter(col("category_id") === category)}} > val sample = underSample(tmpDataSet, category) > sample > } > {{ samples.reduce((x, y) => x.unionAll(y))}} > } > > {{ def underSample(dataSet: DataFrame, cardID: String): DataFrame = { > val positiveSample = dataSet.filter(col("label") > 0.5).sample(false, 0.1)}} > {{ val negativeSample = dataSet.filter(col("label") < 0.5).sample(false, > 0.1)}} > {{ val positiveSample.unionAll(negativeSample)}} > } > > But the code blocked in `{{samples.reduce((x, y) => x.unionAll(y))`}}, and > it runs slowly and slowly, and even cannot run any more.It confused me a long > time.Who can help me? Than you! -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org