[ https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16442176#comment-16442176 ]
liuxian edited comment on SPARK-23989 at 4/18/18 9:21 AM: ---------------------------------------------------------- test({color:#6a8759}"groupBy"{color}) { {color:#808080} spark.conf.set("spark.sql.shuffle.partitions", 16777217){color} {color:#cc7832}val {color}df1 = {color:#9876aa}Seq{color}(({color:#6a8759}"a"{color}{color:#cc7832}, {color}{color:#6897bb}1{color}{color:#cc7832}, {color}{color:#6897bb}0{color}{color:#cc7832}, {color}{color:#6a8759}"b"{color}){color:#cc7832}, {color}({color:#6a8759}"b"{color}{color:#cc7832}, {color}{color:#6897bb}2{color}{color:#cc7832}, {color}{color:#6897bb}4{color}{color:#cc7832}, {color}{color:#6a8759}"c"{color}){color:#cc7832}, {color}({color:#6a8759}"a"{color}{color:#cc7832}, {color}{color:#6897bb}2{color}{color:#cc7832}, {color}{color:#6897bb}3{color}{color:#cc7832}, {color}{color:#6a8759}"d"{color})) .toDF({color:#6a8759}"key"{color}{color:#cc7832}, {color}{color:#6a8759}"value1"{color}{color:#cc7832}, {color}{color:#6a8759}"value2"{color}{color:#cc7832}, {color}{color:#6a8759}"rest"{color}) checkAnswer( df1.groupBy({color:#6a8759}"key"{color}).min({color:#6a8759}"value2"{color}){color:#cc7832},{color} {color:#9876aa}Seq{color}(Row({color:#6a8759}"a"{color}{color:#cc7832}, {color}{color:#6897bb}0{color}){color:#cc7832}, {color}Row({color:#6a8759}"b"{color}{color:#cc7832}, {color}{color:#6897bb}4{color})) ) } Because the number of partitions is too large, it will run for a long time. The number of partitions is so large that the purpose is to go `SortShuffleWriter` was (Author: 10110346): test({color:#6a8759}"groupBy"{color}) { {color:#808080} spark.conf.set("spark.sql.shuffle.partitions", 16777217){color}{color:#808080} {color} {color:#cc7832}val {color}df1 = {color:#9876aa}Seq{color}(({color:#6a8759}"a"{color}{color:#cc7832}, {color}{color:#6897bb}1{color}{color:#cc7832}, {color}{color:#6897bb}0{color}{color:#cc7832}, {color}{color:#6a8759}"b"{color}){color:#cc7832}, {color}({color:#6a8759}"b"{color}{color:#cc7832}, {color}{color:#6897bb}2{color}{color:#cc7832}, {color}{color:#6897bb}4{color}{color:#cc7832}, {color}{color:#6a8759}"c"{color}){color:#cc7832}, {color}({color:#6a8759}"a"{color}{color:#cc7832}, {color}{color:#6897bb}2{color}{color:#cc7832}, {color}{color:#6897bb}3{color}{color:#cc7832}, {color}{color:#6a8759}"d"{color})) .toDF({color:#6a8759}"key"{color}{color:#cc7832}, {color}{color:#6a8759}"value1"{color}{color:#cc7832}, {color}{color:#6a8759}"value2"{color}{color:#cc7832}, {color}{color:#6a8759}"rest"{color}) checkAnswer( df1.groupBy({color:#6a8759}"key"{color}).min({color:#6a8759}"value2"{color}){color:#cc7832}, {color} {color:#9876aa}Seq{color}(Row({color:#6a8759}"a"{color}{color:#cc7832}, {color}{color:#6897bb}0{color}){color:#cc7832}, {color}Row({color:#6a8759}"b"{color}{color:#cc7832}, {color}{color:#6897bb}4{color})) ) } Because the number of partitions is too large, it will run for a long time. The number of partitions is so large that the purpose is to go `SortShuffleWriter` > When using `SortShuffleWriter`, the data will be overwritten > ------------------------------------------------------------ > > Key: SPARK-23989 > URL: https://issues.apache.org/jira/browse/SPARK-23989 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.3.0 > Reporter: liuxian > Priority: Critical > > {color:#333333}When using `SortShuffleWriter`, we only insert > '{color}{color:#cc7832}AnyRef{color}{color:#333333}' into > '{color}PartitionedAppendOnlyMap{color:#333333}' or > '{color}PartitionedPairBuffer{color:#333333}'.{color} > {color:#333333}For this function:{color} > {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: > {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, > {color}{color:#4e807d}V{color}]]) > the value of 'records' is `UnsafeRow`, so the value will be overwritten > {color:#333333} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org