[ https://issues.apache.org/jira/browse/SPARK-24078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451679#comment-16451679 ]
Hyukjin Kwon commented on SPARK-24078: -------------------------------------- Would you be able to test this in higher versions? > reduce with unionAll takes a long time > -------------------------------------- > > Key: SPARK-24078 > URL: https://issues.apache.org/jira/browse/SPARK-24078 > Project: Spark > Issue Type: Bug > Components: Build > Affects Versions: 1.6.3 > Reporter: zhangsongcheng > Priority: Major > > I try to sample the traning sets with each category,and then uion all samples > together.This is my code: > def balance4Single(dataSet: DataFrame): DataFrame = { > val samples = LabelConf.cardIDList.map { cardID => > val tmpDataSet = dataSet.filter(col("card_id") === cardID) > val sample = underSample(tmpDataSet, cardID) > sample > } > samples.reduce((x, y) => x.unionAll(y)) > } > def underSample(dataSet: DataFrame, cardID: String): DataFrame = { > val positiveSample = dataSet.filter(col("label") > 0.5).sample(false, 0.1) > val negativeSample = dataSet.filter(col("label") < 0.5).sample(false, 0.1) > positiveSample.unionAll(negativeSample).distinct() > } > But the code blocked in {{samples.reduce((x, y) => x.unionAll(y))}}, and it > runs slowly and slowly, and even cannot run any more.It confused me a long > time.Who can help me? Than you! -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org