[jira] [Assigned] (SPARK-17637) Packed scheduling for Spark tasks across executors
[ https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17637: Assignee: Apache Spark (was: Zhan Zhang) > Packed scheduling for Spark tasks across executors > -- > > Key: SPARK-17637 > URL: https://issues.apache.org/jira/browse/SPARK-17637 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Zhan Zhang >Assignee: Apache Spark >Priority: Minor > Fix For: 2.1.0 > > > Currently Spark scheduler implements round robin scheduling for tasks to > executors. Which is great as it distributes the load evenly across the > cluster, but this leads to significant resource waste in some cases, > especially when dynamic allocation is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17637) Packed scheduling for Spark tasks across executors
[ https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17637: Assignee: Zhan Zhang (was: Apache Spark) > Packed scheduling for Spark tasks across executors > -- > > Key: SPARK-17637 > URL: https://issues.apache.org/jira/browse/SPARK-17637 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Zhan Zhang >Assignee: Zhan Zhang >Priority: Minor > Fix For: 2.1.0 > > > Currently Spark scheduler implements round robin scheduling for tasks to > executors. Which is great as it distributes the load evenly across the > cluster, but this leads to significant resource waste in some cases, > especially when dynamic allocation is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-17637) Packed scheduling for Spark tasks across executors
[ https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reopened SPARK-17637: - > Packed scheduling for Spark tasks across executors > -- > > Key: SPARK-17637 > URL: https://issues.apache.org/jira/browse/SPARK-17637 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Zhan Zhang >Assignee: Zhan Zhang >Priority: Minor > Fix For: 2.1.0 > > > Currently Spark scheduler implements round robin scheduling for tasks to > executors. Which is great as it distributes the load evenly across the > cluster, but this leads to significant resource waste in some cases, > especially when dynamic allocation is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17637) Packed scheduling for Spark tasks across executors
[ https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15579382#comment-15579382 ] Reynold Xin commented on SPARK-17637: - Note: I reverted the commit due to quality issues. We can improve it and merge it again. > Packed scheduling for Spark tasks across executors > -- > > Key: SPARK-17637 > URL: https://issues.apache.org/jira/browse/SPARK-17637 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Zhan Zhang >Assignee: Zhan Zhang >Priority: Minor > Fix For: 2.1.0 > > > Currently Spark scheduler implements round robin scheduling for tasks to > executors. Which is great as it distributes the load evenly across the > cluster, but this leads to significant resource waste in some cases, > especially when dynamic allocation is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17958) Why I ran into issue " accumulator, copyandreset must be zero error"
dianwei Han created SPARK-17958: --- Summary: Why I ran into issue " accumulator, copyandreset must be zero error" Key: SPARK-17958 URL: https://issues.apache.org/jira/browse/SPARK-17958 Project: Spark Issue Type: Question Components: Java API Affects Versions: 2.0.1 Environment: linux. Reporter: dianwei Han Priority: Blocker I used to run spark code under spark 1.5. No problem at all. right now, I ran code on spark 2.0. see error "Exception in thread "main" java.lang.AssertionError: assertion failed: copyAndReset must return a zero value copy at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:157) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) Please help or give suggestion on how to fix the issue? Really appreciate it, Dianwei -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows
[ https://issues.apache.org/jira/browse/SPARK-17957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15579294#comment-15579294 ] Linbo commented on SPARK-17957: --- Thank you! > Calling outer join and na.fill(0) and then inner join will miss rows > > > Key: SPARK-17957 > URL: https://issues.apache.org/jira/browse/SPARK-17957 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Spark 2.0.1, Mac, Local >Reporter: Linbo >Priority: Critical > Labels: correctness > > I reported a similar bug two months ago and it's fixed in Spark 2.0.1: > https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when > I insert a na.fill(0) call between outer join and inner join in the same > workflow in SPARK-17060 I get wrong result. > {code:title=spark-shell|borderStyle=solid} > scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b") > a: org.apache.spark.sql.DataFrame = [a: int, b: int] > scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c") > b: org.apache.spark.sql.DataFrame = [a: int, c: int] > scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0) > ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field] > scala> ab.show > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 2| 0| > | 3| 0| 4| > | 2| 3| 5| > +---+---+---+ > scala> val c = Seq((3, 1)).toDF("a", "d") > c: org.apache.spark.sql.DataFrame = [a: int, d: int] > scala> c.show > +---+---+ > | a| d| > +---+---+ > | 3| 1| > +---+---+ > scala> ab.join(c, "a").show > +---+---+---+---+ > | a| b| c| d| > +---+---+---+---+ > +---+---+---+---+ > {code} > And again if i use persist, the result is correct. I think the problem is > join optimizer similar to this pr: https://github.com/apache/spark/pull/14661 > {code:title=spark-shell|borderStyle=solid} > scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist > ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int > ... 1 more field] > scala> ab.show > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 2| 0| > | 3| 0| 4| > | 2| 3| 5| > +---+---+---+ > scala> ab.join(c, "a").show > +---+---+---+---+ > | a| b| c| d| > +---+---+---+---+ > | 3| 0| 4| 1| > +---+---+---+---+ > {code} > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows
[ https://issues.apache.org/jira/browse/SPARK-17957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-17957: Priority: Critical (was: Major) > Calling outer join and na.fill(0) and then inner join will miss rows > > > Key: SPARK-17957 > URL: https://issues.apache.org/jira/browse/SPARK-17957 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Spark 2.0.1, Mac, Local >Reporter: Linbo >Priority: Critical > Labels: correctness > > I reported a similar bug two months ago and it's fixed in Spark 2.0.1: > https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when > I insert a na.fill(0) call between outer join and inner join in the same > workflow in SPARK-17060 I get wrong result. > {code:title=spark-shell|borderStyle=solid} > scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b") > a: org.apache.spark.sql.DataFrame = [a: int, b: int] > scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c") > b: org.apache.spark.sql.DataFrame = [a: int, c: int] > scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0) > ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field] > scala> ab.show > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 2| 0| > | 3| 0| 4| > | 2| 3| 5| > +---+---+---+ > scala> val c = Seq((3, 1)).toDF("a", "d") > c: org.apache.spark.sql.DataFrame = [a: int, d: int] > scala> c.show > +---+---+ > | a| d| > +---+---+ > | 3| 1| > +---+---+ > scala> ab.join(c, "a").show > +---+---+---+---+ > | a| b| c| d| > +---+---+---+---+ > +---+---+---+---+ > {code} > And again if i use persist, the result is correct. I think the problem is > join optimizer similar to this pr: https://github.com/apache/spark/pull/14661 > {code:title=spark-shell|borderStyle=solid} > scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist > ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int > ... 1 more field] > scala> ab.show > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 2| 0| > | 3| 0| 4| > | 2| 3| 5| > +---+---+---+ > scala> ab.join(c, "a").show > +---+---+---+---+ > | a| b| c| d| > +---+---+---+---+ > | 3| 0| 4| 1| > +---+---+---+---+ > {code} > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows
[ https://issues.apache.org/jira/browse/SPARK-17957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-17957: Target Version/s: 2.0.2, 2.1.0 (was: 2.0.2) > Calling outer join and na.fill(0) and then inner join will miss rows > > > Key: SPARK-17957 > URL: https://issues.apache.org/jira/browse/SPARK-17957 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Spark 2.0.1, Mac, Local >Reporter: Linbo > Labels: correctness > > I reported a similar bug two months ago and it's fixed in Spark 2.0.1: > https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when > I insert a na.fill(0) call between outer join and inner join in the same > workflow in SPARK-17060 I get wrong result. > {code:title=spark-shell|borderStyle=solid} > scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b") > a: org.apache.spark.sql.DataFrame = [a: int, b: int] > scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c") > b: org.apache.spark.sql.DataFrame = [a: int, c: int] > scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0) > ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field] > scala> ab.show > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 2| 0| > | 3| 0| 4| > | 2| 3| 5| > +---+---+---+ > scala> val c = Seq((3, 1)).toDF("a", "d") > c: org.apache.spark.sql.DataFrame = [a: int, d: int] > scala> c.show > +---+---+ > | a| d| > +---+---+ > | 3| 1| > +---+---+ > scala> ab.join(c, "a").show > +---+---+---+---+ > | a| b| c| d| > +---+---+---+---+ > +---+---+---+---+ > {code} > And again if i use persist, the result is correct. I think the problem is > join optimizer similar to this pr: https://github.com/apache/spark/pull/14661 > {code:title=spark-shell|borderStyle=solid} > scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist > ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int > ... 1 more field] > scala> ab.show > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 2| 0| > | 3| 0| 4| > | 2| 3| 5| > +---+---+---+ > scala> ab.join(c, "a").show > +---+---+---+---+ > | a| b| c| d| > +---+---+---+---+ > | 3| 0| 4| 1| > +---+---+---+---+ > {code} > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15579284#comment-15579284 ] Shixiong Zhu commented on SPARK-13747: -- [~chinwei] could you post the stack trace here? > Concurrent execution in SQL doesn't work with Scala ForkJoinPool > > > Key: SPARK-13747 > URL: https://issues.apache.org/jira/browse/SPARK-13747 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Shixiong Zhu >Assignee: Andrew Or > Fix For: 2.0.0 > > > Run the following codes may fail > {code} > (1 to 100).par.foreach { _ => > println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()) > } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > This is because SparkContext.runJob can be suspended when using a > ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it > calls Await.ready (introduced by https://github.com/apache/spark/pull/9264). > So when SparkContext.runJob is suspended, ForkJoinPool will run another task > in the same thread, however, the local properties has been polluted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows
[ https://issues.apache.org/jira/browse/SPARK-17957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-17957: Labels: correctness (was: joins na.fill) > Calling outer join and na.fill(0) and then inner join will miss rows > > > Key: SPARK-17957 > URL: https://issues.apache.org/jira/browse/SPARK-17957 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Spark 2.0.1, Mac, Local >Reporter: Linbo > Labels: correctness > > I reported a similar bug two months ago and it's fixed in Spark 2.0.1: > https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when > I insert a na.fill(0) call between outer join and inner join in the same > workflow in SPARK-17060 I get wrong result. > {code:title=spark-shell|borderStyle=solid} > scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b") > a: org.apache.spark.sql.DataFrame = [a: int, b: int] > scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c") > b: org.apache.spark.sql.DataFrame = [a: int, c: int] > scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0) > ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field] > scala> ab.show > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 2| 0| > | 3| 0| 4| > | 2| 3| 5| > +---+---+---+ > scala> val c = Seq((3, 1)).toDF("a", "d") > c: org.apache.spark.sql.DataFrame = [a: int, d: int] > scala> c.show > +---+---+ > | a| d| > +---+---+ > | 3| 1| > +---+---+ > scala> ab.join(c, "a").show > +---+---+---+---+ > | a| b| c| d| > +---+---+---+---+ > +---+---+---+---+ > {code} > And again if i use persist, the result is correct. I think the problem is > join optimizer similar to this pr: https://github.com/apache/spark/pull/14661 > {code:title=spark-shell|borderStyle=solid} > scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist > ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int > ... 1 more field] > scala> ab.show > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 2| 0| > | 3| 0| 4| > | 2| 3| 5| > +---+---+---+ > scala> ab.join(c, "a").show > +---+---+---+---+ > | a| b| c| d| > +---+---+---+---+ > | 3| 0| 4| 1| > +---+---+---+---+ > {code} > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows
[ https://issues.apache.org/jira/browse/SPARK-17957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15579282#comment-15579282 ] Xiao Li commented on SPARK-17957: - Found the bug. {noformat} Project [a#29, b#30, c#31, d#48] +- Join Inner, (a#29 = a#47) :- Project [cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as int) AS a#29, cast(coalesce(cast(b#6 as double), 0.0) as int) AS b#30, cast(coalesce(cast(c#16 as double), 0.0) as int) AS c#31] : +- Filter isnotnull(cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as int)) : +- Join FullOuter, (a#5 = a#15) ::- LocalRelation [a#5, b#6] :+- LocalRelation [a#15, c#16] +- LocalRelation [a#47, d#48] {noformat} Will fix it soon. > Calling outer join and na.fill(0) and then inner join will miss rows > > > Key: SPARK-17957 > URL: https://issues.apache.org/jira/browse/SPARK-17957 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Spark 2.0.1, Mac, Local >Reporter: Linbo > Labels: joins, na.fill > > I reported a similar bug two months ago and it's fixed in Spark 2.0.1: > https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when > I insert a na.fill(0) call between outer join and inner join in the same > workflow in SPARK-17060 I get wrong result. > {code:title=spark-shell|borderStyle=solid} > scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b") > a: org.apache.spark.sql.DataFrame = [a: int, b: int] > scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c") > b: org.apache.spark.sql.DataFrame = [a: int, c: int] > scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0) > ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field] > scala> ab.show > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 2| 0| > | 3| 0| 4| > | 2| 3| 5| > +---+---+---+ > scala> val c = Seq((3, 1)).toDF("a", "d") > c: org.apache.spark.sql.DataFrame = [a: int, d: int] > scala> c.show > +---+---+ > | a| d| > +---+---+ > | 3| 1| > +---+---+ > scala> ab.join(c, "a").show > +---+---+---+---+ > | a| b| c| d| > +---+---+---+---+ > +---+---+---+---+ > {code} > And again if i use persist, the result is correct. I think the problem is > join optimizer similar to this pr: https://github.com/apache/spark/pull/14661 > {code:title=spark-shell|borderStyle=solid} > scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist > ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int > ... 1 more field] > scala> ab.show > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 2| 0| > | 3| 0| 4| > | 2| 3| 5| > +---+---+---+ > scala> ab.join(c, "a").show > +---+---+---+---+ > | a| b| c| d| > +---+---+---+---+ > | 3| 0| 4| 1| > +---+---+---+---+ > {code} > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17954) FetchFailedException executor cannot connect to another worker executor
[ https://issues.apache.org/jira/browse/SPARK-17954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15579262#comment-15579262 ] Tejas Patil commented on SPARK-17954: - I agree with [~srowen]'s comment about this being more of a networking or IP/hostname problem. Offcourse, this is based on the limited information that we have at hand. Have you done an investigation around this issue ? - Were you able to ping / telnet from one node to another while this exception happened ? If you are running in docker / other container stuff, you will have to do this check at the container level. - You mentioned that things worked fine with 1.6. Was it running side by side in the same cluster, in the same time window when 2.0 had fetch failures ? > FetchFailedException executor cannot connect to another worker executor > --- > > Key: SPARK-17954 > URL: https://issues.apache.org/jira/browse/SPARK-17954 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1 >Reporter: Vitaly Gerasimov > > I have standalone mode spark cluster wich have three nodes: > master.test > worker1.test > worker2.test > I am trying to run the next code in spark shell: > {code} > val json = spark.read.json("hdfs://master.test/json/a.js.gz", > "hdfs://master.test/json/b.js.gz") > json.createOrReplaceTempView("messages") > spark.sql("select count(*) from messages").show() > {code} > and I am getting the following exception: > {code} > org.apache.spark.shuffle.FetchFailedException: Failed to connect to > worker1.test/x.x.x.x:51029 > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to connect to > worker1.test/x.x.x.x:51029 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:
[jira] [Commented] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows
[ https://issues.apache.org/jira/browse/SPARK-17957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15579260#comment-15579260 ] Xiao Li commented on SPARK-17957: - You can see the plan. The optimized plan is still full outer join. : ) Thus, it should not be caused by outer join elimination. {noformat} val a = Seq((1, 2), (2, 3)).toDF("a", "b") val b = Seq((2, 5), (3, 4)).toDF("a", "c") val ab = a.join(b, Seq("a"), "fullouter").na.fill(0) val c = Seq((3, 1)).toDF("a", "d") ab.join(c, "a").explain(true) {noformat} {noformat} == Optimized Logical Plan == Project [a#29, b#30, c#31, d#42] +- Join Inner, (a#29 = a#41) :- Project [cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as int) AS a#29, cast(coalesce(cast(b#6 as double), 0.0) as int) AS b#30, cast(coalesce(cast(c#16 as double), 0.0) as int) AS c#31] : +- Filter isnotnull(cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as int)) : +- Join FullOuter, (a#5 = a#15) ::- LocalRelation [a#5, b#6] :+- LocalRelation [a#15, c#16] +- LocalRelation [a#41, d#42] {noformat} Let me find what is the cause. > Calling outer join and na.fill(0) and then inner join will miss rows > > > Key: SPARK-17957 > URL: https://issues.apache.org/jira/browse/SPARK-17957 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Spark 2.0.1, Mac, Local >Reporter: Linbo > Labels: joins, na.fill > > I reported a similar bug two months ago and it's fixed in Spark 2.0.1: > https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when > I insert a na.fill(0) call between outer join and inner join in the same > workflow in SPARK-17060 I get wrong result. > {code:title=spark-shell|borderStyle=solid} > scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b") > a: org.apache.spark.sql.DataFrame = [a: int, b: int] > scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c") > b: org.apache.spark.sql.DataFrame = [a: int, c: int] > scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0) > ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field] > scala> ab.show > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 2| 0| > | 3| 0| 4| > | 2| 3| 5| > +---+---+---+ > scala> val c = Seq((3, 1)).toDF("a", "d") > c: org.apache.spark.sql.DataFrame = [a: int, d: int] > scala> c.show > +---+---+ > | a| d| > +---+---+ > | 3| 1| > +---+---+ > scala> ab.join(c, "a").show > +---+---+---+---+ > | a| b| c| d| > +---+---+---+---+ > +---+---+---+---+ > {code} > And again if i use persist, the result is correct. I think the problem is > join optimizer similar to this pr: https://github.com/apache/spark/pull/14661 > {code:title=spark-shell|borderStyle=solid} > scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist > ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int > ... 1 more field] > scala> ab.show > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 2| 0| > | 3| 0| 4| > | 2| 3| 5| > +---+---+---+ > scala> ab.join(c, "a").show > +---+---+---+---+ > | a| b| c| d| > +---+---+---+---+ > | 3| 0| 4| 1| > +---+---+---+---+ > {code} > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows
[ https://issues.apache.org/jira/browse/SPARK-17957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15579252#comment-15579252 ] Xiao Li commented on SPARK-17957: - Thank you for reporting it. Let me do a quick check. > Calling outer join and na.fill(0) and then inner join will miss rows > > > Key: SPARK-17957 > URL: https://issues.apache.org/jira/browse/SPARK-17957 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Spark 2.0.1, Mac, Local >Reporter: Linbo > Labels: joins, na.fill > > I reported a similar bug two months ago and it's fixed in Spark 2.0.1: > https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when > I insert a na.fill(0) call between outer join and inner join in the same > workflow in SPARK-17060 I get wrong result. > {code:title=spark-shell|borderStyle=solid} > scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b") > a: org.apache.spark.sql.DataFrame = [a: int, b: int] > scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c") > b: org.apache.spark.sql.DataFrame = [a: int, c: int] > scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0) > ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field] > scala> ab.show > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 2| 0| > | 3| 0| 4| > | 2| 3| 5| > +---+---+---+ > scala> val c = Seq((3, 1)).toDF("a", "d") > c: org.apache.spark.sql.DataFrame = [a: int, d: int] > scala> c.show > +---+---+ > | a| d| > +---+---+ > | 3| 1| > +---+---+ > scala> ab.join(c, "a").show > +---+---+---+---+ > | a| b| c| d| > +---+---+---+---+ > +---+---+---+---+ > {code} > And again if i use persist, the result is correct. I think the problem is > join optimizer similar to this pr: https://github.com/apache/spark/pull/14661 > {code:title=spark-shell|borderStyle=solid} > scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist > ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int > ... 1 more field] > scala> ab.show > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 2| 0| > | 3| 0| 4| > | 2| 3| 5| > +---+---+---+ > scala> ab.join(c, "a").show > +---+---+---+---+ > | a| b| c| d| > +---+---+---+---+ > | 3| 0| 4| 1| > +---+---+---+---+ > {code} > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17812: Assignee: Apache Spark (was: Cody Koeninger) > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Apache Spark > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions > currently agreed on plan: > Mutually exclusive subscription options (only assign is new to this ticket) > {noformat} > .option("subscribe","topicFoo,topicBar") > .option("subscribePattern","topic.*") > .option("assign","""{"topicfoo": [0, 1],"topicbar": [0, 1]}""") > {noformat} > where assign can only be specified that way, no inline offsets > Single starting position option with three mutually exclusive types of value > {noformat} > .option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": > 1234, "1": -2}, "topicBar":{"0": -1}}""") > {noformat} > startingOffsets with json fails if any topicpartition in the assignments > doesn't have an offset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17812: Assignee: Cody Koeninger (was: Apache Spark) > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Cody Koeninger > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions > currently agreed on plan: > Mutually exclusive subscription options (only assign is new to this ticket) > {noformat} > .option("subscribe","topicFoo,topicBar") > .option("subscribePattern","topic.*") > .option("assign","""{"topicfoo": [0, 1],"topicbar": [0, 1]}""") > {noformat} > where assign can only be specified that way, no inline offsets > Single starting position option with three mutually exclusive types of value > {noformat} > .option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": > 1234, "1": -2}, "topicBar":{"0": -1}}""") > {noformat} > startingOffsets with json fails if any topicpartition in the assignments > doesn't have an offset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15579206#comment-15579206 ] Apache Spark commented on SPARK-17812: -- User 'koeninger' has created a pull request for this issue: https://github.com/apache/spark/pull/15504 > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Cody Koeninger > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions > currently agreed on plan: > Mutually exclusive subscription options (only assign is new to this ticket) > {noformat} > .option("subscribe","topicFoo,topicBar") > .option("subscribePattern","topic.*") > .option("assign","""{"topicfoo": [0, 1],"topicbar": [0, 1]}""") > {noformat} > where assign can only be specified that way, no inline offsets > Single starting position option with three mutually exclusive types of value > {noformat} > .option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": > 1234, "1": -2}, "topicBar":{"0": -1}}""") > {noformat} > startingOffsets with json fails if any topicpartition in the assignments > doesn't have an offset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17952) Java SparkSession createDataFrame method throws exception for nested JavaBeans
[ https://issues.apache.org/jira/browse/SPARK-17952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Baghel updated SPARK-17952: Description: As per latest spark documentation for Java at http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection, {quote} Nested JavaBeans and List or Array fields are supported though. {quote} However nested JavaBean is not working. Please see the below code. SubCategory class {code} public class SubCategory implements Serializable{ private String id; private String name; public String getId() { return id; } public void setId(String id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } } {code} Category class {code} public class Category implements Serializable{ private String id; private SubCategory subCategory; public String getId() { return id; } public void setId(String id) { this.id = id; } public SubCategory getSubCategory() { return subCategory; } public void setSubCategory(SubCategory subCategory) { this.subCategory = subCategory; } } {code} SparkSample class {code} public class SparkSample { public static void main(String[] args) throws IOException { SparkSession spark = SparkSession .builder() .appName("SparkSample") .master("local") .getOrCreate(); //SubCategory SubCategory sub = new SubCategory(); sub.setId("sc-111"); sub.setName("Sub-1"); //Category Category category = new Category(); category.setId("s-111"); category.setSubCategory(sub); //categoryList List categoryList = new ArrayList(); categoryList.add(category); //DF Dataset dframe = spark.createDataFrame(categoryList, Category.class); dframe.show(); } } {code} Above code throws below error. {code} Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d (of class com.sample.SubCategory) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103) at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$class.toStream(Iterator.scala:1322) at scala.collection.AbstractIterator.toStream(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298) at scala.collection.AbstractIterator.toSeq(Iterator.scala:1336) at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:373) at com.sample.SparkSample.main(SparkSample.java:33) {code} createDataFrame method throws above exception. But I observed that createDataset method works fine with below code. {code} Encoder encoder = Encoders.bean(Category.class); Dataset dframe = spark.createDataset(categoryList, encoder); dframe.show(); {code} was: As per latest spark documentation for Java at http://spark.apac
[jira] [Updated] (SPARK-17952) Java SparkSession createDataFrame method throws exception for nested JavaBeans
[ https://issues.apache.org/jira/browse/SPARK-17952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Baghel updated SPARK-17952: Description: As per latest spark documentation for Java at http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection, {quote} "Nested JavaBeans and List or Array fields are supported though". {quote} However nested JavaBean is not working. Please see the below code. SubCategory class {code} public class SubCategory implements Serializable{ private String id; private String name; public String getId() { return id; } public void setId(String id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } } {code} Category class {code} public class Category implements Serializable{ private String id; private SubCategory subCategory; public String getId() { return id; } public void setId(String id) { this.id = id; } public SubCategory getSubCategory() { return subCategory; } public void setSubCategory(SubCategory subCategory) { this.subCategory = subCategory; } } {code} SparkSample class {code} public class SparkSample { public static void main(String[] args) throws IOException { SparkSession spark = SparkSession .builder() .appName("SparkSample") .master("local") .getOrCreate(); //SubCategory SubCategory sub = new SubCategory(); sub.setId("sc-111"); sub.setName("Sub-1"); //Category Category category = new Category(); category.setId("s-111"); category.setSubCategory(sub); //categoryList List categoryList = new ArrayList(); categoryList.add(category); //DF Dataset dframe = spark.createDataFrame(categoryList, Category.class); dframe.show(); } } {code} Above code throws below error. {code} Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d (of class com.sample.SubCategory) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103) at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$class.toStream(Iterator.scala:1322) at scala.collection.AbstractIterator.toStream(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298) at scala.collection.AbstractIterator.toSeq(Iterator.scala:1336) at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:373) at com.sample.SparkSample.main(SparkSample.java:33) {code} createDataFrame method throws above exception. But I observed that createDataset method works fine with below code. {code} Encoder encoder = Encoders.bean(Category.class); Dataset dframe = spark.createDataset(categoryList, encoder); dframe.show(); {code} was: As per latest spark documentation for Java at http://spark.a
[jira] [Updated] (SPARK-17952) Java SparkSession createDataFrame method throws exception for nested JavaBeans
[ https://issues.apache.org/jira/browse/SPARK-17952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Baghel updated SPARK-17952: Description: As per latest spark documentation for Java at http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection, "Nested JavaBeans and List or Array fields are supported though". However nested JavaBean is not working. Please see the below code. SubCategory class {code} public class SubCategory implements Serializable{ private String id; private String name; public String getId() { return id; } public void setId(String id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } } {code} Category class {code} public class Category implements Serializable{ private String id; private SubCategory subCategory; public String getId() { return id; } public void setId(String id) { this.id = id; } public SubCategory getSubCategory() { return subCategory; } public void setSubCategory(SubCategory subCategory) { this.subCategory = subCategory; } } {code} SparkSample class {code} public class SparkSample { public static void main(String[] args) throws IOException { SparkSession spark = SparkSession .builder() .appName("SparkSample") .master("local") .getOrCreate(); //SubCategory SubCategory sub = new SubCategory(); sub.setId("sc-111"); sub.setName("Sub-1"); //Category Category category = new Category(); category.setId("s-111"); category.setSubCategory(sub); //categoryList List categoryList = new ArrayList(); categoryList.add(category); //DF Dataset dframe = spark.createDataFrame(categoryList, Category.class); dframe.show(); } } {code} Above code throws below error. {code} Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d (of class com.sample.SubCategory) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103) at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$class.toStream(Iterator.scala:1322) at scala.collection.AbstractIterator.toStream(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298) at scala.collection.AbstractIterator.toSeq(Iterator.scala:1336) at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:373) at com.sample.SparkSample.main(SparkSample.java:33) {code} createDataFrame method throws above exception. But I observed that createDataset method works fine with below code. {code} Encoder encoder = Encoders.bean(Category.class); Dataset dframe = spark.createDataset(categoryList, encoder); dframe.show(); {code} was: As per latest spark documentation for Java at http://spark.apache.org/docs/late
[jira] [Comment Edited] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows
[ https://issues.apache.org/jira/browse/SPARK-17957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15579142#comment-15579142 ] Linbo edited comment on SPARK-17957 at 10/16/16 2:26 AM: - cc [~smilegator] and [~dongjoon] was (Author: linbojin): cc [~smilegator] > Calling outer join and na.fill(0) and then inner join will miss rows > > > Key: SPARK-17957 > URL: https://issues.apache.org/jira/browse/SPARK-17957 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Spark 2.0.1, Mac, Local >Reporter: Linbo > Labels: joins, na.fill > > I reported a similar bug two months ago and it's fixed in Spark 2.0.1: > https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when > I insert a na.fill(0) call between outer join and inner join in the same > workflow in SPARK-17060 I get wrong result. > {code:title=spark-shell|borderStyle=solid} > scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b") > a: org.apache.spark.sql.DataFrame = [a: int, b: int] > scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c") > b: org.apache.spark.sql.DataFrame = [a: int, c: int] > scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0) > ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field] > scala> ab.show > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 2| 0| > | 3| 0| 4| > | 2| 3| 5| > +---+---+---+ > scala> val c = Seq((3, 1)).toDF("a", "d") > c: org.apache.spark.sql.DataFrame = [a: int, d: int] > scala> c.show > +---+---+ > | a| d| > +---+---+ > | 3| 1| > +---+---+ > scala> ab.join(c, "a").show > +---+---+---+---+ > | a| b| c| d| > +---+---+---+---+ > +---+---+---+---+ > {code} > And again if i use persist, the result is correct. I think the problem is > join optimizer similar to this pr: https://github.com/apache/spark/pull/14661 > {code:title=spark-shell|borderStyle=solid} > scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist > ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int > ... 1 more field] > scala> ab.show > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 2| 0| > | 3| 0| 4| > | 2| 3| 5| > +---+---+---+ > scala> ab.join(c, "a").show > +---+---+---+---+ > | a| b| c| d| > +---+---+---+---+ > | 3| 0| 4| 1| > +---+---+---+---+ > {code} > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows
[ https://issues.apache.org/jira/browse/SPARK-17957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15579142#comment-15579142 ] Linbo commented on SPARK-17957: --- cc [~smilegator] > Calling outer join and na.fill(0) and then inner join will miss rows > > > Key: SPARK-17957 > URL: https://issues.apache.org/jira/browse/SPARK-17957 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Spark 2.0.1, Mac, Local >Reporter: Linbo > Labels: joins, na.fill > > I reported a similar bug two months ago and it's fixed in Spark 2.0.1: > https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when > I insert a na.fill(0) call between outer join and inner join in the same > workflow in SPARK-17060 I get wrong result. > {code:title=spark-shell|borderStyle=solid} > scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b") > a: org.apache.spark.sql.DataFrame = [a: int, b: int] > scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c") > b: org.apache.spark.sql.DataFrame = [a: int, c: int] > scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0) > ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field] > scala> ab.show > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 2| 0| > | 3| 0| 4| > | 2| 3| 5| > +---+---+---+ > scala> val c = Seq((3, 1)).toDF("a", "d") > c: org.apache.spark.sql.DataFrame = [a: int, d: int] > scala> c.show > +---+---+ > | a| d| > +---+---+ > | 3| 1| > +---+---+ > scala> ab.join(c, "a").show > +---+---+---+---+ > | a| b| c| d| > +---+---+---+---+ > +---+---+---+---+ > {code} > And again if i use persist, the result is correct. I think the problem is > join optimizer similar to this pr: https://github.com/apache/spark/pull/14661 > {code:title=spark-shell|borderStyle=solid} > scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist > ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int > ... 1 more field] > scala> ab.show > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 2| 0| > | 3| 0| 4| > | 2| 3| 5| > +---+---+---+ > scala> ab.join(c, "a").show > +---+---+---+---+ > | a| b| c| d| > +---+---+---+---+ > | 3| 0| 4| 1| > +---+---+---+---+ > {code} > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows
[ https://issues.apache.org/jira/browse/SPARK-17957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Linbo updated SPARK-17957: -- Description: I reported a similar bug two months ago and it's fixed in Spark 2.0.1: https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when I insert a na.fill(0) call between outer join and inner join in the same workflow in SPARK-17060 I get wrong result. {code:title=spark-shell|borderStyle=solid} scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b") a: org.apache.spark.sql.DataFrame = [a: int, b: int] scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c") b: org.apache.spark.sql.DataFrame = [a: int, c: int] scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0) ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field] scala> ab.show +---+---+---+ | a| b| c| +---+---+---+ | 1| 2| 0| | 3| 0| 4| | 2| 3| 5| +---+---+---+ scala> val c = Seq((3, 1)).toDF("a", "d") c: org.apache.spark.sql.DataFrame = [a: int, d: int] scala> c.show +---+---+ | a| d| +---+---+ | 3| 1| +---+---+ scala> ab.join(c, "a").show +---+---+---+---+ | a| b| c| d| +---+---+---+---+ +---+---+---+---+ {code} And again if i use persist, the result is correct. I think the problem is join optimizer similar to this pr: https://github.com/apache/spark/pull/14661 {code:title=spark-shell|borderStyle=solid} scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int ... 1 more field] scala> ab.show +---+---+---+ | a| b| c| +---+---+---+ | 1| 2| 0| | 3| 0| 4| | 2| 3| 5| +---+---+---+ scala> ab.join(c, "a").show +---+---+---+---+ | a| b| c| d| +---+---+---+---+ | 3| 0| 4| 1| +---+---+---+---+ {code} was: I reported a similar bug two months ago and it's fixed in Spark 2.0.1: https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when I insert a na.fill(0) call between outer join and inner join in the same workflow in SPARK-17060 I get wrong result. {code:title=spark-shell|borderStyle=solid} scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b") a: org.apache.spark.sql.DataFrame = [a: int, b: int] scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c") b: org.apache.spark.sql.DataFrame = [a: int, c: int] scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0) ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field] scala> ab.show +---+---+---+ | a| b| c| +---+---+---+ | 1| 2| 0| | 3| 0| 4| | 2| 3| 5| +---+---+---+ scala> val c = Seq((3, 1)).toDF("a", "d") c: org.apache.spark.sql.DataFrame = [a: int, d: int] scala> c.show +---+---+ | a| d| +---+---+ | 3| 1| +---+---+ scala> ab.join(c, "a").show +---+---+---+---+ | a| b| c| d| +---+---+---+---+ +---+---+---+---+ scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0) ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field] scala> ab.join(c, "a").show +---+---+---+---+ | a| b| c| d| +---+---+---+---+ +---+---+---+---+ {code} And again if i use persist, the result is correct. I think the problem is join optimizer similar to this pr: https://github.com/apache/spark/pull/14661 {code:title=spark-shell|borderStyle=solid} scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int ... 1 more field] scala> ab.show +---+---+---+ | a| b| c| +---+---+---+ | 1| 2| 0| | 3| 0| 4| | 2| 3| 5| +---+---+---+ scala> ab.join(c, "a").show +---+---+---+---+ | a| b| c| d| +---+---+---+---+ | 3| 0| 4| 1| +---+---+---+---+ {code} > Calling outer join and na.fill(0) and then inner join will miss rows > > > Key: SPARK-17957 > URL: https://issues.apache.org/jira/browse/SPARK-17957 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Spark 2.0.1, Mac, Local >Reporter: Linbo > Labels: joins, na.fill > > I reported a similar bug two months ago and it's fixed in Spark 2.0.1: > https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when > I insert a na.fill(0) call between outer join and inner join in the same > workflow in SPARK-17060 I get wrong result. > {code:title=spark-shell|borderStyle=solid} > scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b") > a: org.apache.spark.sql.DataFrame = [a: int, b: int] > scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c") > b: org.apache.spark.sql.DataFrame = [a: int, c: int] > scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0) > ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field] > scala> ab.show > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 2| 0| > | 3| 0| 4| > | 2| 3| 5| > +---+---+---+ > scala>
[jira] [Created] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows
Linbo created SPARK-17957: - Summary: Calling outer join and na.fill(0) and then inner join will miss rows Key: SPARK-17957 URL: https://issues.apache.org/jira/browse/SPARK-17957 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1 Environment: Spark 2.0.1, Mac, Local Reporter: Linbo I reported a similar bug two months ago and it's fixed in Spark 2.0.1: https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when I insert a na.fill(0) call between outer join and inner join in the same workflow in SPARK-17060 I get wrong result. {code:title=spark-shell|borderStyle=solid} scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b") a: org.apache.spark.sql.DataFrame = [a: int, b: int] scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c") b: org.apache.spark.sql.DataFrame = [a: int, c: int] scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0) ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field] scala> ab.show +---+---+---+ | a| b| c| +---+---+---+ | 1| 2| 0| | 3| 0| 4| | 2| 3| 5| +---+---+---+ scala> val c = Seq((3, 1)).toDF("a", "d") c: org.apache.spark.sql.DataFrame = [a: int, d: int] scala> c.show +---+---+ | a| d| +---+---+ | 3| 1| +---+---+ scala> ab.join(c, "a").show +---+---+---+---+ | a| b| c| d| +---+---+---+---+ +---+---+---+---+ scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0) ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field] scala> ab.join(c, "a").show +---+---+---+---+ | a| b| c| d| +---+---+---+---+ +---+---+---+---+ {code} And again if i use persist, the result is correct. I think the problem is join optimizer similar to this pr: https://github.com/apache/spark/pull/14661 {code:title=spark-shell|borderStyle=solid} scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int ... 1 more field] scala> ab.show +---+---+---+ | a| b| c| +---+---+---+ | 1| 2| 0| | 3| 0| 4| | 2| 3| 5| +---+---+---+ scala> ab.join(c, "a").show +---+---+---+---+ | a| b| c| d| +---+---+---+---+ | 3| 0| 4| 1| +---+---+---+---+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17637) Packed scheduling for Spark tasks across executors
[ https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan updated SPARK-17637: Affects Version/s: 2.1.0 > Packed scheduling for Spark tasks across executors > -- > > Key: SPARK-17637 > URL: https://issues.apache.org/jira/browse/SPARK-17637 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Zhan Zhang >Assignee: Zhan Zhang >Priority: Minor > Fix For: 2.1.0 > > > Currently Spark scheduler implements round robin scheduling for tasks to > executors. Which is great as it distributes the load evenly across the > cluster, but this leads to significant resource waste in some cases, > especially when dynamic allocation is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17637) Packed scheduling for Spark tasks across executors
[ https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-17637. - Resolution: Fixed Assignee: Zhan Zhang Target Version/s: 2.1.0 > Packed scheduling for Spark tasks across executors > -- > > Key: SPARK-17637 > URL: https://issues.apache.org/jira/browse/SPARK-17637 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Zhan Zhang >Assignee: Zhan Zhang >Priority: Minor > Fix For: 2.1.0 > > > Currently Spark scheduler implements round robin scheduling for tasks to > executors. Which is great as it distributes the load evenly across the > cluster, but this leads to significant resource waste in some cases, > especially when dynamic allocation is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17637) Packed scheduling for Spark tasks across executors
[ https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan updated SPARK-17637: Fix Version/s: 2.1.0 > Packed scheduling for Spark tasks across executors > -- > > Key: SPARK-17637 > URL: https://issues.apache.org/jira/browse/SPARK-17637 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Zhan Zhang >Priority: Minor > Fix For: 2.1.0 > > > Currently Spark scheduler implements round robin scheduling for tasks to > executors. Which is great as it distributes the load evenly across the > cluster, but this leads to significant resource waste in some cases, > especially when dynamic allocation is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17951) BlockFetch with multiple threads slows down after spark 1.6
[ https://issues.apache.org/jira/browse/SPARK-17951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578987#comment-15578987 ] ding edited comment on SPARK-17951 at 10/16/16 12:22 AM: - I tried to call rdd.collect which internally called bm.getRemoteBytes in multiple threads. The data shows difference of time spend on block fetch is also around 0.1s between spark 1.5.1 and spark 1.6.2. And it can be consistently repro. In our scenario, we called getRemoteBytes twice. If we use spark 1.6.2 it will bring additional 0.2s overhead and cause around 5 decrease of throughput. we would like to reduce the overhead if possible. was (Author: ding): I tried to call rdd.collect which internally called bm.getRemoteBytes in multiple threads. The data shows difference of time spend on block fetch is also around 0.1s between spark 1.5.1 and spark 1.6.2. And it can be consistently repro. In our scenario, we called getRemoteBytes twice. If we use spark 1.6.2 it will bring additional 0.2s overhead and cause around 5 decrease of throughput. we would like to reduce the overhead as much as possible. > BlockFetch with multiple threads slows down after spark 1.6 > --- > > Key: SPARK-17951 > URL: https://issues.apache.org/jira/browse/SPARK-17951 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 1.6.2 > Environment: cluster with 8 node, each node has 28 cores. 10Gb network >Reporter: ding > > The following code demonstrates the issue: > {code} > def main(args: Array[String]): Unit = { > val conf = new SparkConf().setAppName(s"BMTest") > val size = 3344570 > val sc = new SparkContext(conf) > val data = sc.parallelize(1 to 100, 8) > var accum = sc.accumulator(0.0, "get remote bytes") > var i = 0 > while(i < 91) { > accum = sc.accumulator(0.0, "get remote bytes") > val test = data.mapPartitionsWithIndex { (pid, iter) => > val N = size > val bm = SparkEnv.get.blockManager > val blockId = TaskResultBlockId(10*i + pid) > val test = new Array[Byte](N) > Random.nextBytes(test) > val buffer = ByteBuffer.allocate(N) > buffer.limit(N) > buffer.put(test) > bm.putBytes(blockId, buffer, StorageLevel.MEMORY_ONLY_SER) > Iterator(1) > }.count() > > data.mapPartitionsWithIndex { (pid, iter) => > val before = System.nanoTime() > > val bm = SparkEnv.get.blockManager > (0 to 7).map(s => { > Future { > val result = bm.getRemoteBytes(TaskResultBlockId(10*i + s)) > } > }).map(Await.result(_, Duration.Inf)) > > accum.add((System.nanoTime() - before) / 1e9) > Iterator(1) > }.count() > println("get remote bytes take: " + accum.value/8) > i += 1 > } > } > {code} > In spark1.6.2, average of "getting remote bytes" time is: 0.19 s while > in spark 1.5.1 average of "getting remote bytes" time is: 0.09 s > However if fetch block in single thread, the gap is much smaller. > spark1.6.2 get remote bytes: 0.21 s > spark1.5.1 get remote bytes: 0.20 s -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17951) BlockFetch with multiple threads slows down after spark 1.6
[ https://issues.apache.org/jira/browse/SPARK-17951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578987#comment-15578987 ] ding commented on SPARK-17951: -- I tried to call rdd.collect which internally called bm.getRemoteBytes in multiple threads. The data shows difference of time spend on block fetch is also around 0.1s between spark 1.5.1 and spark 1.6.2. And it can be consistently repro. In our scenario, we called getRemoteBytes twice. If we use spark 1.6.2 it will bring additional 0.2s overhead and cause around 5 decrease of throughput. we would like to reduce the overhead as much as possible. > BlockFetch with multiple threads slows down after spark 1.6 > --- > > Key: SPARK-17951 > URL: https://issues.apache.org/jira/browse/SPARK-17951 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 1.6.2 > Environment: cluster with 8 node, each node has 28 cores. 10Gb network >Reporter: ding > > The following code demonstrates the issue: > {code} > def main(args: Array[String]): Unit = { > val conf = new SparkConf().setAppName(s"BMTest") > val size = 3344570 > val sc = new SparkContext(conf) > val data = sc.parallelize(1 to 100, 8) > var accum = sc.accumulator(0.0, "get remote bytes") > var i = 0 > while(i < 91) { > accum = sc.accumulator(0.0, "get remote bytes") > val test = data.mapPartitionsWithIndex { (pid, iter) => > val N = size > val bm = SparkEnv.get.blockManager > val blockId = TaskResultBlockId(10*i + pid) > val test = new Array[Byte](N) > Random.nextBytes(test) > val buffer = ByteBuffer.allocate(N) > buffer.limit(N) > buffer.put(test) > bm.putBytes(blockId, buffer, StorageLevel.MEMORY_ONLY_SER) > Iterator(1) > }.count() > > data.mapPartitionsWithIndex { (pid, iter) => > val before = System.nanoTime() > > val bm = SparkEnv.get.blockManager > (0 to 7).map(s => { > Future { > val result = bm.getRemoteBytes(TaskResultBlockId(10*i + s)) > } > }).map(Await.result(_, Duration.Inf)) > > accum.add((System.nanoTime() - before) / 1e9) > Iterator(1) > }.count() > println("get remote bytes take: " + accum.value/8) > i += 1 > } > } > {code} > In spark1.6.2, average of "getting remote bytes" time is: 0.19 s while > in spark 1.5.1 average of "getting remote bytes" time is: 0.09 s > However if fetch block in single thread, the gap is much smaller. > spark1.6.2 get remote bytes: 0.21 s > spark1.5.1 get remote bytes: 0.20 s -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578914#comment-15578914 ] Lars Francke edited comment on SPARK-650 at 10/15/16 11:22 PM: --- I can only come up with three reasons at the moment. I hope they all make sense. 1) Singletons/Static Initialisers run once per Class Loader where this class is being loaded/used. I haven't actually seen this being a problem (and it might actually be desired behaviour in this case) but making the init step explicit would prevent this from ever becoming one. 2) I'd like to fail fast for some things and not upon first access (which might be behind a conditional somewhere) 3) It is hard enough to reason about where some piece of code is running as it is (Driver or Task/Executor), adding Singletons to the mix makes it even more confusing. Thank you for reopening! was (Author: lars_francke): I can only come up with three reasons at the moment. I hope they all make sense. 1) Singletons/Static Initialisers run once per Class Loader where this class is being loaded/used. I haven't actually seen this being a problem (and it might actually be desired behaviour in this case) but making the init step explicit would prevent this from ever becoming one. 2) I'd like to fail fast for some things and not upon first access (which might be behind a conditional somewhere) 3) It is hard enough to reason about where some piece of code is running as it is (Driver or Task/Executor), adding Singletons to the mix makes it even more confusing. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578914#comment-15578914 ] Lars Francke commented on SPARK-650: I can only come up with three reasons at the moment. I hope they all make sense. 1) Singletons/Static Initialisers run once per Class Loader where this class is being loaded/used. I haven't actually seen this being a problem (and it might actually be desired behaviour in this case) but making the init step explicit would prevent this from ever becoming one. 2) I'd like to fail fast for some things and not upon first access (which might be behind a conditional somewhere) 3) It is hard enough to reason about where some piece of code is running as it is (Driver or Task/Executor), adding Singletons to the mix makes it even more confusing. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578894#comment-15578894 ] Cody Koeninger commented on SPARK-17812: As you just said yourself, assign doesn't mean you necessarily know the exact starting offsets you want. > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Cody Koeninger > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions > currently agreed on plan: > Mutually exclusive subscription options (only assign is new to this ticket) > {noformat} > .option("subscribe","topicFoo,topicBar") > .option("subscribePattern","topic.*") > .option("assign","""{"topicfoo": [0, 1],"topicbar": [0, 1]}""") > {noformat} > where assign can only be specified that way, no inline offsets > Single starting position option with three mutually exclusive types of value > {noformat} > .option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": > 1234, "1": -2}, "topicBar":{"0": -1}}""") > {noformat} > startingOffsets with json fails if any topicpartition in the assignments > doesn't have an offset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-650: - As you wish, but, I disagree with this type of reasoning about JIRAs. I dont think anyone has addressed why a singleton isn't the answer. I can think of corner cases, but, that's why I suspect it isn't something that has needed implementing. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578875#comment-15578875 ] Lars Francke commented on SPARK-650: I also have to disagree with this being a duplicate or obsolete. [~oarmand] and [~Skamandros] already mentioned reasons regarding the duplication. About it being obsolete: I have seen multiple clients facing this problem, finding this issue and hoping it'd get fixed some day. I would hesitate a guess and say that most _users_ of Spark have no JIRA account here and do not register or log in just to vote for this issue. That said: This issue is (with six votes) in the top 150 out of almost 17k total issues in the Spark project. As it happens this is a non-trivial thing to implement in Spark (as far as I can tell from my limited knowledge of the inner workings) so it's pretty hard for a "drive by" contributor to help here. You had the discussion about community perception on the mailing list (re: Spark Improvement Proposals) and this issue happens to be one of those that at least I see popping up every once in a while in discussions with clients. I would love to see this issue staying open as a feature request and have some hope that someone someday will implement it. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-17865) R API for global temp view
[ https://issues.apache.org/jira/browse/SPARK-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-17865. --- Resolution: Not A Problem > R API for global temp view > -- > > Key: SPARK-17865 > URL: https://issues.apache.org/jira/browse/SPARK-17865 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > We need to add the R API for managing global temp views, mirroring the > changes in SPARK-17338. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17865) R API for global temp view
[ https://issues.apache.org/jira/browse/SPARK-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578835#comment-15578835 ] Reynold Xin commented on SPARK-17865: - Alright in this case I don't think we need R API for now. > R API for global temp view > -- > > Key: SPARK-17865 > URL: https://issues.apache.org/jira/browse/SPARK-17865 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > We need to add the R API for managing global temp views, mirroring the > changes in SPARK-17338. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578723#comment-15578723 ] Saikat Kanjilal commented on SPARK-9487: Synched the code, am familiarizing myself first with how to run unit tests and work in the code , [~srowen][~holdenk], next steps will be to run the unit tests and report the results here, stay tuned. > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578699#comment-15578699 ] Ofir Manor commented on SPARK-17812: I think Michael suggest that if you use {{startingOffsets}} without using {assign}}, that could work (use the topic-partition list from the startingOffsets), and would simplify the user coding (not needing to specify two similar lists, simpler resume etc). You could keep an explicit {{assign}} for the "more rare?" cases, if someone wants to specify a list of topic-partitions but also earliest / latest / timestamp. > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Cody Koeninger > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions > currently agreed on plan: > Mutually exclusive subscription options (only assign is new to this ticket) > {noformat} > .option("subscribe","topicFoo,topicBar") > .option("subscribePattern","topic.*") > .option("assign","""{"topicfoo": [0, 1],"topicbar": [0, 1]}""") > {noformat} > where assign can only be specified that way, no inline offsets > Single starting position option with three mutually exclusive types of value > {noformat} > .option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": > 1234, "1": -2}, "topicBar":{"0": -1}}""") > {noformat} > startingOffsets with json fails if any topicpartition in the assignments > doesn't have an offset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17865) R API for global temp view
[ https://issues.apache.org/jira/browse/SPARK-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578639#comment-15578639 ] Felix Cheung commented on SPARK-17865: -- I see. So then we could either omit the support for global session/SharedState in R since there is only one, or, alternatively build support for it. For the latter, I think a simple approach would be to add a flag for sparkR.session.stop(keep.global.session = TRUE) in which case the SparkContent and global session will not be stopped and will be reused when a new session is initialized. With that we could then add methods for global temp view and so on. Do we believe there will be more features around global session in the future? If so there might be good reason to support this in R. > R API for global temp view > -- > > Key: SPARK-17865 > URL: https://issues.apache.org/jira/browse/SPARK-17865 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > We need to add the R API for managing global temp views, mirroring the > changes in SPARK-17338. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17878) Support for multiple null values when reading CSV data
[ https://issues.apache.org/jira/browse/SPARK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578607#comment-15578607 ] Hossein Falaki commented on SPARK-17878: I think moving it to another ticket is a good idea. One concern is that the API would be different between Scala and DDL. > Support for multiple null values when reading CSV data > -- > > Key: SPARK-17878 > URL: https://issues.apache.org/jira/browse/SPARK-17878 > Project: Spark > Issue Type: Story > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > There are CSV files out there with multiple values that are supposed to be > interpreted as null. As a result, multiple spark users have asked for this > feature built into the CSV data source. It can be easily implemented in a > backwards compatible way: > - Currently CSV data source supports an option named {{nullValue}}. > - We can add logic in {{CSVOptions}} to understands option names that match > {{nullValue[\d]}}. This way user can specify a query with multiple or one > null value. > {code} > val df = spark.read.format("CSV").option("nullValue1", > "-").option("nullValue2", "*") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17954) FetchFailedException executor cannot connect to another worker executor
[ https://issues.apache.org/jira/browse/SPARK-17954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578542#comment-15578542 ] Sean Owen commented on SPARK-17954: --- Lots of things changed. I think you need to first confirm connectivity to that hostname from the host in question to rule out more basic env problems. It may be a domain name vs IP problem. > FetchFailedException executor cannot connect to another worker executor > --- > > Key: SPARK-17954 > URL: https://issues.apache.org/jira/browse/SPARK-17954 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1 >Reporter: Vitaly Gerasimov > > I have standalone mode spark cluster wich have three nodes: > master.test > worker1.test > worker2.test > I am trying to run the next code in spark shell: > {code} > val json = spark.read.json("hdfs://master.test/json/a.js.gz", > "hdfs://master.test/json/b.js.gz") > json.createOrReplaceTempView("messages") > spark.sql("select count(*) from messages").show() > {code} > and I am getting the following exception: > {code} > org.apache.spark.shuffle.FetchFailedException: Failed to connect to > worker1.test/x.x.x.x:51029 > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to connect to > worker1.test/x.x.x.x:51029 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ... 3 more > Caused by: java.net.ConnectException: Connection refused: > worker1.test/x.x.x.x:51029 > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConne
[jira] [Commented] (SPARK-17954) FetchFailedException executor cannot connect to another worker executor
[ https://issues.apache.org/jira/browse/SPARK-17954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578539#comment-15578539 ] Vitaly Gerasimov commented on SPARK-17954: -- I don't think so. Spark 1.6 works fine in this case. May be something was changed in Spark 2.0 configuration? > FetchFailedException executor cannot connect to another worker executor > --- > > Key: SPARK-17954 > URL: https://issues.apache.org/jira/browse/SPARK-17954 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1 >Reporter: Vitaly Gerasimov > > I have standalone mode spark cluster wich have three nodes: > master.test > worker1.test > worker2.test > I am trying to run the next code in spark shell: > {code} > val json = spark.read.json("hdfs://master.test/json/a.js.gz", > "hdfs://master.test/json/b.js.gz") > json.createOrReplaceTempView("messages") > spark.sql("select count(*) from messages").show() > {code} > and I am getting the following exception: > {code} > org.apache.spark.shuffle.FetchFailedException: Failed to connect to > worker1.test/x.x.x.x:51029 > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to connect to > worker1.test/x.x.x.x:51029 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ... 3 more > Caused by: java.net.ConnectException: Connection refused: > worker1.test/x.x.x.x:51029 > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > io.netty.channel.socke
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578535#comment-15578535 ] Sean Owen commented on SPARK-650: - If you need init to happen ASAP when the driver starts, isn't any similar mechanism going to be about the same in this regard? This cost is paid just once, and I don't think in general startup is very low latency for any Spark app. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578487#comment-15578487 ] Olivier Armand commented on SPARK-650: -- Sean, a singleton is not the best option in our case. The Spark Streaming executors are writing to HBase, we need to initialize the HBase connection. The singleton seems (or seemed when we tested it for our customer a few months after this issue was raised) to be created when the first RDD is processed by the executor, and not when the driver starts. This imposes very high processing time for the first events. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578378#comment-15578378 ] Sean Owen commented on SPARK-650: - Sorry, I mean the _status_ doesn't matter. Most issues this old are obsolete or de facto won't-fix. Resolving it or not doesn't matter. I would even say this is 'not a problem', because a simple singleton provides once-per-executor execution of whatever you like. It's more complex to make a custom mechanism that makes you route this via Spark. That's probably way this hasn't proved necessary. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578361#comment-15578361 ] Michael Schmeißer commented on SPARK-650: - Then somebody should please explain to me, how this doesn't matter or rather how certain use-cases are supposed to be solved. We need to initialize each JVM and connect it to our logging system, set correlation IDs, initialize contexts and so on. I guess that most users just have implemented work-arounds as we did, but in an enterprise environment, this is really not the preferable long-term solution to me. Plus, I think that it would really not be hard to implement this feature for someone who has knowledge about the Spark executor setup. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578338#comment-15578338 ] Sean Owen commented on SPARK-650: - In practice, these should probably all be WontFix as it hasn't mattered enough to implement in almost 4 years. It really doesn't matter. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17892) Query in CTAS is Optimized Twice (branch-2.0)
[ https://issues.apache.org/jira/browse/SPARK-17892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17892: Assignee: Xiao Li (was: Apache Spark) > Query in CTAS is Optimized Twice (branch-2.0) > - > > Key: SPARK-17892 > URL: https://issues.apache.org/jira/browse/SPARK-17892 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Yin Huai >Assignee: Xiao Li >Priority: Blocker > > This tracks the work that fixes the problem shown in > https://issues.apache.org/jira/browse/SPARK-17409 to branch 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17892) Query in CTAS is Optimized Twice (branch-2.0)
[ https://issues.apache.org/jira/browse/SPARK-17892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578300#comment-15578300 ] Apache Spark commented on SPARK-17892: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/15502 > Query in CTAS is Optimized Twice (branch-2.0) > - > > Key: SPARK-17892 > URL: https://issues.apache.org/jira/browse/SPARK-17892 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Yin Huai >Assignee: Xiao Li >Priority: Blocker > > This tracks the work that fixes the problem shown in > https://issues.apache.org/jira/browse/SPARK-17409 to branch 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17892) Query in CTAS is Optimized Twice (branch-2.0)
[ https://issues.apache.org/jira/browse/SPARK-17892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17892: Assignee: Apache Spark (was: Xiao Li) > Query in CTAS is Optimized Twice (branch-2.0) > - > > Key: SPARK-17892 > URL: https://issues.apache.org/jira/browse/SPARK-17892 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Yin Huai >Assignee: Apache Spark >Priority: Blocker > > This tracks the work that fixes the problem shown in > https://issues.apache.org/jira/browse/SPARK-17409 to branch 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-636) Add mechanism to run system management/configuration tasks on all workers
[ https://issues.apache.org/jira/browse/SPARK-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578294#comment-15578294 ] Michael Schmeißer commented on SPARK-636: - I agree, that's why I also feel that these issues are no duplicates. > Add mechanism to run system management/configuration tasks on all workers > - > > Key: SPARK-636 > URL: https://issues.apache.org/jira/browse/SPARK-636 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Josh Rosen > > It would be useful to have a mechanism to run a task on all workers in order > to perform system management tasks, such as purging caches or changing system > properties. This is useful for automated experiments and benchmarking; I > don't envision this being used for heavy computation. > Right now, I can mimic this with something like > {code} > sc.parallelize(0 until numMachines, numMachines).foreach { } > {code} > but this does not guarantee that every worker runs a task and requires my > user code to know the number of workers. > One sample use case is setup and teardown for benchmark tests. For example, > I might want to drop cached RDDs, purge shuffle data, and call > {{System.gc()}} between test runs. It makes sense to incorporate some of > this functionality, such as dropping cached RDDs, into Spark itself, but it > might be helpful to have a general mechanism for running ad-hoc tasks like > {{System.gc()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578289#comment-15578289 ] Michael Schmeißer commented on SPARK-650: - I disagree that those issues are duplicates. Spark-636 looks for a generic way to execute code on the Executors, but not for a reliable and easy mechanism to execute code during Executor initialization. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578200#comment-15578200 ] Ashish Shrowty commented on SPARK-17709: Oh .. sorry .. I misread. Will try with 2.0.1 later > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty >Priority: Critical > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
[ https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578176#comment-15578176 ] zhangxinyu commented on SPARK-17935: I will try it on 0.10 in the future. But as I know most of users are still using kafka 0.8, since it's stable. So IMO the priority of 0.8 is higher than 0.10. > Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module > -- > > Key: SPARK-17935 > URL: https://issues.apache.org/jira/browse/SPARK-17935 > Project: Spark > Issue Type: Improvement > Components: SQL, Streaming >Affects Versions: 2.0.0 >Reporter: zhangxinyu > > Now spark already supports kafkaInputStream. It would be useful that we add > `KafkaForeachWriter` to output results to kafka in structured streaming > module. > `KafkaForeachWriter.scala` is put in external kafka-0.8.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17954) FetchFailedException executor cannot connect to another worker executor
[ https://issues.apache.org/jira/browse/SPARK-17954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578155#comment-15578155 ] Sean Owen commented on SPARK-17954: --- Is this not just a networking or IP/hostname problem? it says it can't connect. > FetchFailedException executor cannot connect to another worker executor > --- > > Key: SPARK-17954 > URL: https://issues.apache.org/jira/browse/SPARK-17954 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1 >Reporter: Vitaly Gerasimov > > I have standalone mode spark cluster wich have three nodes: > master.test > worker1.test > worker2.test > I am trying to run the next code in spark shell: > {code} > val json = spark.read.json("hdfs://master.test/json/a.js.gz", > "hdfs://master.test/json/b.js.gz") > json.createOrReplaceTempView("messages") > spark.sql("select count(*) from messages").show() > {code} > and I am getting the following exception: > {code} > org.apache.spark.shuffle.FetchFailedException: Failed to connect to > worker1.test/x.x.x.x:51029 > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to connect to > worker1.test/x.x.x.x:51029 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ... 3 more > Caused by: java.net.ConnectException: Connection refused: > worker1.test/x.x.x.x:51029 > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSo
[jira] [Commented] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
[ https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578141#comment-15578141 ] Cody Koeninger commented on SPARK-17935: Why is this in kafka-0-8, when we haven't resolved (for the third time) whether we're even continuing to work on 0.8? Those modules have conflicting dependencies as is. > Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module > -- > > Key: SPARK-17935 > URL: https://issues.apache.org/jira/browse/SPARK-17935 > Project: Spark > Issue Type: Improvement > Components: SQL, Streaming >Affects Versions: 2.0.0 >Reporter: zhangxinyu > > Now spark already supports kafkaInputStream. It would be useful that we add > `KafkaForeachWriter` to output results to kafka in structured streaming > module. > `KafkaForeachWriter.scala` is put in external kafka-0.8.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11524) Support SparkR with Mesos cluster
[ https://issues.apache.org/jira/browse/SPARK-11524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578137#comment-15578137 ] Susan X. Huynh commented on SPARK-11524: I would like to work on this. What are the missing pieces? What configurations need to be tested? [~sunrui][~felixcheung] > Support SparkR with Mesos cluster > - > > Key: SPARK-11524 > URL: https://issues.apache.org/jira/browse/SPARK-11524 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Sun Rui > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17956) ProjectExec has incorrect outputOrdering property
[ https://issues.apache.org/jira/browse/SPARK-17956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17956: Assignee: Apache Spark > ProjectExec has incorrect outputOrdering property > - > > Key: SPARK-17956 > URL: https://issues.apache.org/jira/browse/SPARK-17956 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > Currently ProjectExec simply takes child plan's outputOrdering as its > outputOrdering. In some cases, this leads to incorrect outputOrdering. This > applies to TakeOrderedAndProjectExec too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17956) ProjectExec has incorrect outputOrdering property
[ https://issues.apache.org/jira/browse/SPARK-17956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17956: Assignee: (was: Apache Spark) > ProjectExec has incorrect outputOrdering property > - > > Key: SPARK-17956 > URL: https://issues.apache.org/jira/browse/SPARK-17956 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > > Currently ProjectExec simply takes child plan's outputOrdering as its > outputOrdering. In some cases, this leads to incorrect outputOrdering. This > applies to TakeOrderedAndProjectExec too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17956) ProjectExec has incorrect outputOrdering property
[ https://issues.apache.org/jira/browse/SPARK-17956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578136#comment-15578136 ] Apache Spark commented on SPARK-17956: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/15500 > ProjectExec has incorrect outputOrdering property > - > > Key: SPARK-17956 > URL: https://issues.apache.org/jira/browse/SPARK-17956 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > > Currently ProjectExec simply takes child plan's outputOrdering as its > outputOrdering. In some cases, this leads to incorrect outputOrdering. This > applies to TakeOrderedAndProjectExec too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17938) Backpressure rate not adjusting
[ https://issues.apache.org/jira/browse/SPARK-17938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578133#comment-15578133 ] Cody Koeninger commented on SPARK-17938: There was pretty extensive discussion of this on list, should link or summarize it. Couple of things here: 100 is the default minimum rate for pidestimator. If you're willing to write code, put more logging in to determine why that rate isn't being configured, or hardcode it to a different number. I have successfully adjusted that rate using spark configuration. The other thing is that if your system takes way longer than 1 second to process 100k records, 100k obviously isn't a reasonable max. Many large batches will be defined during the time that first batch is running, before back pressure is involved at all. Try a lower max. > Backpressure rate not adjusting > --- > > Key: SPARK-17938 > URL: https://issues.apache.org/jira/browse/SPARK-17938 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 2.0.0, 2.0.1 >Reporter: Samy Dindane > > spark-streaming 2.0.1 and spark-streaming-kafka-0-10 version is 2.0.1. Same > behavior with 2.0.0 though. > spark.streaming.kafka.consumer.poll.ms is set to 3 > spark.streaming.kafka.maxRatePerPartition is set to 10 > spark.streaming.backpressure.enabled is set to true > `batchDuration` of the streaming context is set to 1 second. > I consume a Kafka topic using KafkaUtils.createDirectStream(). > My system can handle 100k records batches, but it'd take more than 1 seconds > to process them all. I'd thus expect the backpressure to reduce the number of > records that would be fetched in the next batch to keep the processing delay > inferior to 1 second. > Only this does not happen and the rate of the backpressure stays the same: > stuck in `100.0`, no matter how the other variables change (processing time, > error, etc.). > Here's a log showing how all these variables change but the chosen rate stays > the same: https://gist.github.com/Dinduks/d9fa67fc8a036d3cad8e859c508acdba (I > would have attached a file but I don't see how). > Is this the expected behavior and I am missing something, or is this a bug? > I'll gladly help by providing more information or writing code if necessary. > Thank you. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17956) ProjectExec has incorrect outputOrdering property
Liang-Chi Hsieh created SPARK-17956: --- Summary: ProjectExec has incorrect outputOrdering property Key: SPARK-17956 URL: https://issues.apache.org/jira/browse/SPARK-17956 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh Currently ProjectExec simply takes child plan's outputOrdering as its outputOrdering. In some cases, this leads to incorrect outputOrdering. This applies to TakeOrderedAndProjectExec too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17838) Strict type checking for arguments with a better messages across APIs.
[ https://issues.apache.org/jira/browse/SPARK-17838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578101#comment-15578101 ] Hyukjin Kwon commented on SPARK-17838: -- I am trying to work on this but I don't mind if anyone takes over this. It seems take a bit of time to me. > Strict type checking for arguments with a better messages across APIs. > -- > > Key: SPARK-17838 > URL: https://issues.apache.org/jira/browse/SPARK-17838 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Hyukjin Kwon > > It seems there should be more strict type checking for arguments in SparkR > APIs. This was discussed in several PRs. > https://github.com/apache/spark/pull/15239#discussion_r82445435 > Roughly it seems there are three cases as below: > The first case below was described in > https://github.com/apache/spark/pull/15239#discussion_r82445435 > - Check for {{zero-length variable name}} > Some of other cases below were handled in > https://github.com/apache/spark/pull/15231#discussion_r80417904 > - Catch the exception from JVM and format it as pretty > - Check strictly types before calling JVM in SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17878) Support for multiple null values when reading CSV data
[ https://issues.apache.org/jira/browse/SPARK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578098#comment-15578098 ] Hyukjin Kwon commented on SPARK-17878: -- Yes, this might not be a reason to block this feature. It seems not easy and needs to change a lot. Maybe, how about, for example, adding {{option(value: List[String])}} and then converting it into JSON array? Actually, I can move this one into another JIRA as it might not be the reason to block this JIRA but just want to hear your thought before I will try. > Support for multiple null values when reading CSV data > -- > > Key: SPARK-17878 > URL: https://issues.apache.org/jira/browse/SPARK-17878 > Project: Spark > Issue Type: Story > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > There are CSV files out there with multiple values that are supposed to be > interpreted as null. As a result, multiple spark users have asked for this > feature built into the CSV data source. It can be easily implemented in a > backwards compatible way: > - Currently CSV data source supports an option named {{nullValue}}. > - We can add logic in {{CSVOptions}} to understands option names that match > {{nullValue[\d]}}. This way user can specify a query with multiple or one > null value. > {code} > val df = spark.read.format("CSV").option("nullValue1", > "-").option("nullValue2", "*") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17945) Writing to S3 should allow setting object metadata
[ https://issues.apache.org/jira/browse/SPARK-17945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578064#comment-15578064 ] Jeff Schobelock commented on SPARK-17945: - That's true enough. The use case on my end is that we use s3 object Metadata to get things like file delimiters, etc when we process files. > Writing to S3 should allow setting object metadata > -- > > Key: SPARK-17945 > URL: https://issues.apache.org/jira/browse/SPARK-17945 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Jeff Schobelock >Priority: Minor > > I can't find any possible way to use Spark to write to S3 and set user object > metadata. This seems like such a simple thing that I feel I must be missing > somewhere how to do itbut I have yet to find anything. > I don't know what all work adding this would entail. My idea would be that > there is something like: > rdd.saveAsTextFile(s3://testbucket/file).withMetadata(Map > data). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17955) Use the same read path in DataFrameReader.jdbc and DataFrameReader.format("jdbc")
[ https://issues.apache.org/jira/browse/SPARK-17955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17955: Assignee: Apache Spark > Use the same read path in DataFrameReader.jdbc and > DataFrameReader.format("jdbc") > -- > > Key: SPARK-17955 > URL: https://issues.apache.org/jira/browse/SPARK-17955 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Trivial > > It seems APIs in {{DataFrameReader}}/{{DataFrameWriter}} share > {{format("...").load()}} or {{format("...").save()}} APIs for > {{json(...)}}/{{csv(...)}} and etc. > We can share this within {{DataFrameReader.jdbc(...)}} too consistently with > other APIs. > {code} > -// connectionProperties should override settings in extraOptions. > -val params = extraOptions.toMap ++ connectionProperties.asScala.toMap > -val options = new JDBCOptions(url, table, params) > -val relation = JDBCRelation(parts, options)(sparkSession) > -sparkSession.baseRelationToDataFrame(relation) > +// connectionProperties should override settings in extraOptions > +this.extraOptions = this.extraOptions ++ (connectionProperties.asScala) > +// explicit url and dbtable should override all > +this.extraOptions += ("url" -> url, "dbtable" -> table) > +format("jdbc").load() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17955) Use the same read path in DataFrameReader.jdbc and DataFrameReader.format("jdbc")
[ https://issues.apache.org/jira/browse/SPARK-17955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17955: Assignee: (was: Apache Spark) > Use the same read path in DataFrameReader.jdbc and > DataFrameReader.format("jdbc") > -- > > Key: SPARK-17955 > URL: https://issues.apache.org/jira/browse/SPARK-17955 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Priority: Trivial > > It seems APIs in {{DataFrameReader}}/{{DataFrameWriter}} share > {{format("...").load()}} or {{format("...").save()}} APIs for > {{json(...)}}/{{csv(...)}} and etc. > We can share this within {{DataFrameReader.jdbc(...)}} too consistently with > other APIs. > {code} > -// connectionProperties should override settings in extraOptions. > -val params = extraOptions.toMap ++ connectionProperties.asScala.toMap > -val options = new JDBCOptions(url, table, params) > -val relation = JDBCRelation(parts, options)(sparkSession) > -sparkSession.baseRelationToDataFrame(relation) > +// connectionProperties should override settings in extraOptions > +this.extraOptions = this.extraOptions ++ (connectionProperties.asScala) > +// explicit url and dbtable should override all > +this.extraOptions += ("url" -> url, "dbtable" -> table) > +format("jdbc").load() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17955) Use the same read path in DataFrameReader.jdbc and DataFrameReader.format("jdbc")
[ https://issues.apache.org/jira/browse/SPARK-17955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578038#comment-15578038 ] Apache Spark commented on SPARK-17955: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/15499 > Use the same read path in DataFrameReader.jdbc and > DataFrameReader.format("jdbc") > -- > > Key: SPARK-17955 > URL: https://issues.apache.org/jira/browse/SPARK-17955 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Priority: Trivial > > It seems APIs in {{DataFrameReader}}/{{DataFrameWriter}} share > {{format("...").load()}} or {{format("...").save()}} APIs for > {{json(...)}}/{{csv(...)}} and etc. > We can share this within {{DataFrameReader.jdbc(...)}} too consistently with > other APIs. > {code} > -// connectionProperties should override settings in extraOptions. > -val params = extraOptions.toMap ++ connectionProperties.asScala.toMap > -val options = new JDBCOptions(url, table, params) > -val relation = JDBCRelation(parts, options)(sparkSession) > -sparkSession.baseRelationToDataFrame(relation) > +// connectionProperties should override settings in extraOptions > +this.extraOptions = this.extraOptions ++ (connectionProperties.asScala) > +// explicit url and dbtable should override all > +this.extraOptions += ("url" -> url, "dbtable" -> table) > +format("jdbc").load() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17955) Use the same read path in DataFrameReader.jdbc and DataFrameReader.format("jdbc")
Hyukjin Kwon created SPARK-17955: Summary: Use the same read path in DataFrameReader.jdbc and DataFrameReader.format("jdbc") Key: SPARK-17955 URL: https://issues.apache.org/jira/browse/SPARK-17955 Project: Spark Issue Type: Improvement Components: SQL Reporter: Hyukjin Kwon Priority: Trivial It seems APIs in {{DataFrameReader}}/{{DataFrameWriter}} share {{format("...").load()}} or {{format("...").save()}} APIs for {{json(...)}}/{{csv(...)}} and etc. We can share this within {{DataFrameReader.jdbc(...)}} too consistently with other APIs. {code} -// connectionProperties should override settings in extraOptions. -val params = extraOptions.toMap ++ connectionProperties.asScala.toMap -val options = new JDBCOptions(url, table, params) -val relation = JDBCRelation(parts, options)(sparkSession) -sparkSession.baseRelationToDataFrame(relation) +// connectionProperties should override settings in extraOptions +this.extraOptions = this.extraOptions ++ (connectionProperties.asScala) +// explicit url and dbtable should override all +this.extraOptions += ("url" -> url, "dbtable" -> table) +format("jdbc").load() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14428) [SQL] Allow more flexibility when parsing dates and timestamps in json datasources
[ https://issues.apache.org/jira/browse/SPARK-14428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15577994#comment-15577994 ] Hyukjin Kwon commented on SPARK-14428: -- For 1. I guess this was fixed in https://github.com/apache/spark/pull/14279 so we should define the format for read/write. > [SQL] Allow more flexibility when parsing dates and timestamps in json > datasources > -- > > Key: SPARK-14428 > URL: https://issues.apache.org/jira/browse/SPARK-14428 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.1 >Reporter: Michel Lemay >Priority: Minor > Labels: date, features, json, timestamp > > Reading a json with dates and timestamps is limited to predetermined string > formats or long values. > 1) Should be able to set an option on json datasource to parse dates and > timestamps using custom string format. > 2) Should be able to change the interpretation of long values since epoch. > It could support different precisions like days, seconds, milliseconds, > microseconds and nanoseconds. > Something in the lines of : > {code} > object Precision extends Enumeration { > val days, seconds, milliseconds, microseconds, nanoseconds = Value > } > def convertWithPrecision(time: Long, from: Precision.Value, to: > Precision.Value): Long = ... > ... > val dateFormat = parameters.getOrElse("dateFormat", "").trim > val timestampFormat = parameters.getOrElse("timestampFormat", "").trim > val longDatePrecision = getOrElse("longDatePrecision", "days") > val longTimestampPrecision = getOrElse("longTimestampPrecision", > "milliseconds") > {code} > and > {code} > case (VALUE_STRING, DateType) => > val stringValue = parser.getText > val days = if (configOptions.dateFormat.nonEmpty) { > // User defined format, make sure it complies to the SQL DATE > format (number of days) > val sdf = new SimpleDateFormat(configOptions.dateFormat) // Not > thread safe. > DateTimeUtils.convertWithPrecision(sdf.parse(stringValue).getTime, > Precision.milliseconds, Precision.days) > } else if (stringValue.forall(_.isDigit)) { > DateTimeUtils.convertWithPrecision(stringValue.toLong, > configOptions.longDatePrecision, Precision.days) > } else { > // The format of this string will probably be "-mm-dd". > > DateTimeUtils.convertWithPrecision(DateTimeUtils.stringToTime(parser.getText).getTime, > Precision.milliseconds, Precision.days) > } > days.toInt > case (VALUE_NUMBER_INT, DateType) => > DateTimeUtils.convertWithPrecision((parser.getLongValue, > configOptions.longDatePrecision, Precision.days).toInt > {code} > With similar handling for Timestamps. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17954) FetchFailedException executor cannot connect to another worker executor
[ https://issues.apache.org/jira/browse/SPARK-17954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitaly Gerasimov updated SPARK-17954: - Issue Type: Bug (was: Question) > FetchFailedException executor cannot connect to another worker executor > --- > > Key: SPARK-17954 > URL: https://issues.apache.org/jira/browse/SPARK-17954 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1 >Reporter: Vitaly Gerasimov > > I have standalone mode spark cluster wich have three nodes: > master.test > worker1.test > worker2.test > I am trying to run the next code in spark shell: > {code} > val json = spark.read.json("hdfs://master.test/json/a.js.gz", > "hdfs://master.test/json/b.js.gz") > json.createOrReplaceTempView("messages") > spark.sql("select count(*) from messages").show() > {code} > and I am getting the following exception: > {code} > org.apache.spark.shuffle.FetchFailedException: Failed to connect to > worker1.test/x.x.x.x:51029 > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to connect to > worker1.test/x.x.x.x:51029 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ... 3 more > Caused by: java.net.ConnectException: Connection refused: > worker1.test/x.x.x.x:51029 > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioU
[jira] [Created] (SPARK-17954) FetchFailedException executor cannot connect to another worker executor
Vitaly Gerasimov created SPARK-17954: Summary: FetchFailedException executor cannot connect to another worker executor Key: SPARK-17954 URL: https://issues.apache.org/jira/browse/SPARK-17954 Project: Spark Issue Type: Question Affects Versions: 2.0.1, 2.0.0 Reporter: Vitaly Gerasimov I have standalone mode spark cluster wich have three nodes: master.test worker1.test worker2.test I am trying to run the next code in spark shell: {code} val json = spark.read.json("hdfs://master.test/json/a.js.gz", "hdfs://master.test/json/b.js.gz") json.createOrReplaceTempView("messages") spark.sql("select count(*) from messages").show() {code} and I am getting the following exception: {code} org.apache.spark.shuffle.FetchFailedException: Failed to connect to worker1.test/x.x.x.x:51029 at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to connect to worker1.test/x.x.x.x:51029 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179) at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) ... 3 more Caused by: java.net.ConnectException: Connection refused: worker1.test/x.x.x.x:51029 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoo
[jira] [Commented] (SPARK-17930) The SerializerInstance instance used when deserializing a TaskResult is not reused
[ https://issues.apache.org/jira/browse/SPARK-17930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15577852#comment-15577852 ] Sean Owen commented on SPARK-17930: --- Maybe so; I think the question is whether this would cause it to be used by multiple threads at once. Do you have any rough benchmarks that suggest this is a bottleneck? > The SerializerInstance instance used when deserializing a TaskResult is not > reused > --- > > Key: SPARK-17930 > URL: https://issues.apache.org/jira/browse/SPARK-17930 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1, 2.0.1 >Reporter: Guoqiang Li > > The following code is called when the DirectTaskResult instance is > deserialized > {noformat} > def value(): T = { > if (valueObjectDeserialized) { > valueObject > } else { > // Each deserialization creates a new instance of SerializerInstance, > which is very time-consuming > val resultSer = SparkEnv.get.serializer.newInstance() > valueObject = resultSer.deserialize(valueBytes) > valueObjectDeserialized = true > valueObject > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error
[ https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17936. --- Resolution: Fixed Assignee: Takuya Ueshin Fix Version/s: 2.1.0 OK, if you're pretty confident that also happened to fix this issue, that's great. It may even resolve the other JIRA I linked. Evidently resolved by https://github.com/apache/spark/pull/15275 > "CodeGenerator - failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of" method Error > - > > Key: SPARK-17936 > URL: https://issues.apache.org/jira/browse/SPARK-17936 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Justin Miller >Assignee: Takuya Ueshin > Fix For: 2.1.0 > > > Greetings. I'm currently in the process of migrating a project I'm working on > from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift > structs coming from Kafka into Parquet files stored in S3. This conversion > process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll > paste the stack trace below. > org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) > at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242) > at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058) > Also, later on: > 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception > in thread Thread[Executor task launch worker-6,5,run-main-group-0] > java.lang.OutOfMemoryError: Java heap space > I've seen similar issues posted, but those were always on the query side. I > have a hunch that this is happening at write time as the error occurs after > batchDuration. Here's the write snippet. > stream. > flatMap { > case Success(row) => > thriftParseSuccess += 1 > Some(row) > case Failure(ex) => > thriftParseErrors += 1 > logger.error("Error during deserialization: ", ex) > None > }.foreachRDD { rdd => > val sqlContext = SQLContext.getOrCreate(rdd.context) > transformer(sqlContext.createDataFrame(rdd, converter.schema)) > .coalesce(coalesceSize) > .write > .mode(Append) > .partitionBy(partitioning: _*) > .parquet(parquetPath) > } > Please let me know if you can be of assistance and if there's anything I can > do to help. > Best, > Justin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14887) Generated SpecificUnsafeProjection Exceeds JVM Code Size Limits
[ https://issues.apache.org/jira/browse/SPARK-14887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15577832#comment-15577832 ] Sean Owen commented on SPARK-14887: --- Possibly resolved by https://issues.apache.org/jira/browse/SPARK-17702 given discussion in https://issues.apache.org/jira/browse/SPARK-17936 but I'm not sure. > Generated SpecificUnsafeProjection Exceeds JVM Code Size Limits > --- > > Key: SPARK-14887 > URL: https://issues.apache.org/jira/browse/SPARK-14887 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: fang fang chen >Assignee: Davies Liu > > Similiar issue with SPARK-14138 and SPARK-8443: > With large sql syntax(673K), following error happened: > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17944) sbin/start-* scripts use of `hostname -f` fail with Solaris
[ https://issues.apache.org/jira/browse/SPARK-17944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15577826#comment-15577826 ] Sean Owen commented on SPARK-17944: --- Yes, for example `hostname` differs on OS X / Linux but the behavior of `-f` happens to give the same desired output on both. Is `check-hostname` present only on Solaris (or at least, does it do the right thing everywhere we think it exists)? If so I could imagine just checking if that command exists and using it if present. There are 3 usages, and all could have the same treatment. That's not so bad IMHO. A central script to switch out commands seems appealing but the indirection non-trivially complicates the scripts. I'm not sure how far to go to try to support Solaris, not because there's anything wrong with it, but because a) I don't know how many Spark users are on Solaris, and b) I am not sure we know it otherwise works, that there aren't other problems like this > sbin/start-* scripts use of `hostname -f` fail with Solaris > > > Key: SPARK-17944 > URL: https://issues.apache.org/jira/browse/SPARK-17944 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1 > Environment: Solaris 10, Solaris 11 >Reporter: Erik O'Shaughnessy >Priority: Trivial > > {{$SPARK_HOME/sbin/start-master.sh}} fails: > {noformat} > $ ./start-master.sh > usage: hostname [[-t] system_name] >hostname [-D] > starting org.apache.spark.deploy.master.Master, logging to > /home/eoshaugh/local/spark/logs/spark-eoshaugh-org.apache.spark.deploy.master.Master-1-m7-16-002-ld1.out > failed to launch org.apache.spark.deploy.master.Master: > --properties-file FILE Path to a custom Spark properties file. >Default is conf/spark-defaults.conf. > full log in > /home/eoshaugh/local/spark/logs/spark-eoshaugh-org.apache.spark.deploy.master.Master-1-m7-16-002-ld1.out > {noformat} > I found SPARK-17546 which changed the invocation of hostname in > sbin/start-master.sh, sbin/start-slaves.sh and sbin/start-mesos-dispatcher.sh > to include the flag {{-f}}, which is not a valid command line option for the > Solaris hostname implementation. > As a workaround, Solaris users can substitute: > {noformat} > `/usr/sbin/check-hostname | awk '{print $NF}'` > {noformat} > Admittedly not an obvious fix, but it provides equivalent functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17951) BlockFetch with multiple threads slows down after spark 1.6
[ https://issues.apache.org/jira/browse/SPARK-17951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15577637#comment-15577637 ] Sean Owen commented on SPARK-17951: --- I'm not sure this suggests a problem with Spark. It's a very small difference in absolute terms and may say more about the benchmark than Spark. It's also an internal API. Do you have a bigger difference in calls to a user-facing API? > BlockFetch with multiple threads slows down after spark 1.6 > --- > > Key: SPARK-17951 > URL: https://issues.apache.org/jira/browse/SPARK-17951 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 1.6.2 > Environment: cluster with 8 node, each node has 28 cores. 10Gb network >Reporter: ding > > The following code demonstrates the issue: > {code} > def main(args: Array[String]): Unit = { > val conf = new SparkConf().setAppName(s"BMTest") > val size = 3344570 > val sc = new SparkContext(conf) > val data = sc.parallelize(1 to 100, 8) > var accum = sc.accumulator(0.0, "get remote bytes") > var i = 0 > while(i < 91) { > accum = sc.accumulator(0.0, "get remote bytes") > val test = data.mapPartitionsWithIndex { (pid, iter) => > val N = size > val bm = SparkEnv.get.blockManager > val blockId = TaskResultBlockId(10*i + pid) > val test = new Array[Byte](N) > Random.nextBytes(test) > val buffer = ByteBuffer.allocate(N) > buffer.limit(N) > buffer.put(test) > bm.putBytes(blockId, buffer, StorageLevel.MEMORY_ONLY_SER) > Iterator(1) > }.count() > > data.mapPartitionsWithIndex { (pid, iter) => > val before = System.nanoTime() > > val bm = SparkEnv.get.blockManager > (0 to 7).map(s => { > Future { > val result = bm.getRemoteBytes(TaskResultBlockId(10*i + s)) > } > }).map(Await.result(_, Duration.Inf)) > > accum.add((System.nanoTime() - before) / 1e9) > Iterator(1) > }.count() > println("get remote bytes take: " + accum.value/8) > i += 1 > } > } > {code} > In spark1.6.2, average of "getting remote bytes" time is: 0.19 s while > in spark 1.5.1 average of "getting remote bytes" time is: 0.09 s > However if fetch block in single thread, the gap is much smaller. > spark1.6.2 get remote bytes: 0.21 s > spark1.5.1 get remote bytes: 0.20 s -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17951) BlockFetch with multiple threads slows down after spark 1.6
[ https://issues.apache.org/jira/browse/SPARK-17951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17951: -- Description: The following code demonstrates the issue: {code} def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName(s"BMTest") val size = 3344570 val sc = new SparkContext(conf) val data = sc.parallelize(1 to 100, 8) var accum = sc.accumulator(0.0, "get remote bytes") var i = 0 while(i < 91) { accum = sc.accumulator(0.0, "get remote bytes") val test = data.mapPartitionsWithIndex { (pid, iter) => val N = size val bm = SparkEnv.get.blockManager val blockId = TaskResultBlockId(10*i + pid) val test = new Array[Byte](N) Random.nextBytes(test) val buffer = ByteBuffer.allocate(N) buffer.limit(N) buffer.put(test) bm.putBytes(blockId, buffer, StorageLevel.MEMORY_ONLY_SER) Iterator(1) }.count() data.mapPartitionsWithIndex { (pid, iter) => val before = System.nanoTime() val bm = SparkEnv.get.blockManager (0 to 7).map(s => { Future { val result = bm.getRemoteBytes(TaskResultBlockId(10*i + s)) } }).map(Await.result(_, Duration.Inf)) accum.add((System.nanoTime() - before) / 1e9) Iterator(1) }.count() println("get remote bytes take: " + accum.value/8) i += 1 } } {code} In spark1.6.2, average of "getting remote bytes" time is: 0.19 s while in spark 1.5.1 average of "getting remote bytes" time is: 0.09 s However if fetch block in single thread, the gap is much smaller. spark1.6.2 get remote bytes: 0.21 s spark1.5.1 get remote bytes: 0.20 s was: The following code demonstrates the issue: def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName(s"BMTest") val size = 3344570 val sc = new SparkContext(conf) val data = sc.parallelize(1 to 100, 8) var accum = sc.accumulator(0.0, "get remote bytes") var i = 0 while(i < 91) { accum = sc.accumulator(0.0, "get remote bytes") val test = data.mapPartitionsWithIndex { (pid, iter) => val N = size val bm = SparkEnv.get.blockManager val blockId = TaskResultBlockId(10*i + pid) val test = new Array[Byte](N) Random.nextBytes(test) val buffer = ByteBuffer.allocate(N) buffer.limit(N) buffer.put(test) bm.putBytes(blockId, buffer, StorageLevel.MEMORY_ONLY_SER) Iterator(1) }.count() data.mapPartitionsWithIndex { (pid, iter) => val before = System.nanoTime() val bm = SparkEnv.get.blockManager (0 to 7).map(s => { Future { val result = bm.getRemoteBytes(TaskResultBlockId(10*i + s)) } }).map(Await.result(_, Duration.Inf)) accum.add((System.nanoTime() - before) / 1e9) Iterator(1) }.count() println("get remote bytes take: " + accum.value/8) i += 1 } } In spark1.6.2, average of "getting remote bytes" time is: 0.19 s while in spark 1.5.1 average of "getting remote bytes" time is: 0.09 s However if fetch block in single thread, the gap is much smaller. spark1.6.2 get remote bytes: 0.21 s spark1.5.1 get remote bytes: 0.20 s > BlockFetch with multiple threads slows down after spark 1.6 > --- > > Key: SPARK-17951 > URL: https://issues.apache.org/jira/browse/SPARK-17951 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 1.6.2 > Environment: cluster with 8 node, each node has 28 cores. 10Gb network >Reporter: ding > > The following code demonstrates the issue: > {code} > def main(args: Array[String]): Unit = { > val conf = new SparkConf().setAppName(s"BMTest") > val size = 3344570 > val sc = new SparkContext(conf) > val data = sc.parallelize(1 to 100, 8) > var accum = sc.accumulator(0.0, "get remote bytes") > var i = 0 > while(i < 91) { > accum = sc.accumulator(0.0, "get remote bytes") > val test = data.mapPartitionsWithIndex { (pid, iter) => > val N = size > val bm = SparkEnv.get.blockManager > val blockId = TaskResultBlockId(10*i + pid) > val test = new Array[Byte](N) > Random.nextBytes(test) > val buffer = ByteBuffer.allocate(N) > buffer.limit(N) > buffer.put(test) > bm.putBytes(blockId, buffer, StorageLevel.MEMORY_ONLY_SER) > Iterator(1) > }.count() > > data.mapPartitionsWithIndex { (pid, iter) => > val before = System.nanoTime() > >
[jira] [Updated] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16845: Target Version/s: 2.1.0 > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: Java API, ML, MLlib >Affects Versions: 2.0.0 >Reporter: hejie > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17953) Fix typo in SparkSession scaladoc
[ https://issues.apache.org/jira/browse/SPARK-17953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17953: -- Priority: Trivial (was: Minor) > Fix typo in SparkSession scaladoc > - > > Key: SPARK-17953 > URL: https://issues.apache.org/jira/browse/SPARK-17953 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: Tae Jun Kim >Assignee: Tae Jun Kim >Priority: Trivial > Fix For: 2.0.2, 2.1.0 > > > At SparkSession builder example, > {code} > SparkSession.builder() > .master("local") > .appName("Word Count") > .config("spark.some.config.option", "some-value"). > .getOrCreate() > {code} > should be > {code} > SparkSession.builder() > .master("local") > .appName("Word Count") > .config("spark.some.config.option", "some-value") > .getOrCreate() > {code} > There was one dot at the end of the third line > {code} > .config("spark.some.config.option", "some-value"). > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17950) Match SparseVector behavior with DenseVector
[ https://issues.apache.org/jira/browse/SPARK-17950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15577596#comment-15577596 ] Sean Owen commented on SPARK-17950: --- Doesn't that make a potentially huge array every time this is called? > Match SparseVector behavior with DenseVector > > > Key: SPARK-17950 > URL: https://issues.apache.org/jira/browse/SPARK-17950 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 2.0.1 >Reporter: AbderRahman Sobh >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Simply added the `__getattr__` to SparseVector that DenseVector has, but > calls self.toArray() instead of storing a vector all the time in self.array > This allows for use of numpy functions on the values of a SparseVector in the > same direct way that users interact with DenseVectors. > i.e. you can simply call SparseVector.mean() to average the values in the > entire vector. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17953) Fix typo in SparkSession scaladoc
[ https://issues.apache.org/jira/browse/SPARK-17953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-17953. - Resolution: Fixed Assignee: Tae Jun Kim Fix Version/s: 2.1.0 2.0.2 > Fix typo in SparkSession scaladoc > - > > Key: SPARK-17953 > URL: https://issues.apache.org/jira/browse/SPARK-17953 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: Tae Jun Kim >Assignee: Tae Jun Kim >Priority: Minor > Fix For: 2.0.2, 2.1.0 > > > At SparkSession builder example, > {code} > SparkSession.builder() > .master("local") > .appName("Word Count") > .config("spark.some.config.option", "some-value"). > .getOrCreate() > {code} > should be > {code} > SparkSession.builder() > .master("local") > .appName("Word Count") > .config("spark.some.config.option", "some-value") > .getOrCreate() > {code} > There was one dot at the end of the third line > {code} > .config("spark.some.config.option", "some-value"). > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17945) Writing to S3 should allow setting object metadata
[ https://issues.apache.org/jira/browse/SPARK-17945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15577593#comment-15577593 ] Sean Owen commented on SPARK-17945: --- You can just set this with the S3 APIs directly? this borders or something it's not worth Spark passing-through just for S3 unless there's a common use case for it and possibly other FSes > Writing to S3 should allow setting object metadata > -- > > Key: SPARK-17945 > URL: https://issues.apache.org/jira/browse/SPARK-17945 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Jeff Schobelock >Priority: Minor > > I can't find any possible way to use Spark to write to S3 and set user object > metadata. This seems like such a simple thing that I feel I must be missing > somewhere how to do itbut I have yet to find anything. > I don't know what all work adding this would entail. My idea would be that > there is something like: > rdd.saveAsTextFile(s3://testbucket/file).withMetadata(Map > data). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17953) Fix typo in SparkSession scaladoc
[ https://issues.apache.org/jira/browse/SPARK-17953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17953: Assignee: Apache Spark > Fix typo in SparkSession scaladoc > - > > Key: SPARK-17953 > URL: https://issues.apache.org/jira/browse/SPARK-17953 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: Tae Jun Kim >Assignee: Apache Spark >Priority: Minor > > At SparkSession builder example, > {code} > SparkSession.builder() > .master("local") > .appName("Word Count") > .config("spark.some.config.option", "some-value"). > .getOrCreate() > {code} > should be > {code} > SparkSession.builder() > .master("local") > .appName("Word Count") > .config("spark.some.config.option", "some-value") > .getOrCreate() > {code} > There was one dot at the end of the third line > {code} > .config("spark.some.config.option", "some-value"). > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17953) Fix typo in SparkSession scaladoc
[ https://issues.apache.org/jira/browse/SPARK-17953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15577582#comment-15577582 ] Apache Spark commented on SPARK-17953: -- User 'tae-jun' has created a pull request for this issue: https://github.com/apache/spark/pull/15498 > Fix typo in SparkSession scaladoc > - > > Key: SPARK-17953 > URL: https://issues.apache.org/jira/browse/SPARK-17953 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: Tae Jun Kim >Priority: Minor > > At SparkSession builder example, > {code} > SparkSession.builder() > .master("local") > .appName("Word Count") > .config("spark.some.config.option", "some-value"). > .getOrCreate() > {code} > should be > {code} > SparkSession.builder() > .master("local") > .appName("Word Count") > .config("spark.some.config.option", "some-value") > .getOrCreate() > {code} > There was one dot at the end of the third line > {code} > .config("spark.some.config.option", "some-value"). > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17953) Fix typo in SparkSession scaladoc
[ https://issues.apache.org/jira/browse/SPARK-17953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17953: Assignee: (was: Apache Spark) > Fix typo in SparkSession scaladoc > - > > Key: SPARK-17953 > URL: https://issues.apache.org/jira/browse/SPARK-17953 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: Tae Jun Kim >Priority: Minor > > At SparkSession builder example, > {code} > SparkSession.builder() > .master("local") > .appName("Word Count") > .config("spark.some.config.option", "some-value"). > .getOrCreate() > {code} > should be > {code} > SparkSession.builder() > .master("local") > .appName("Word Count") > .config("spark.some.config.option", "some-value") > .getOrCreate() > {code} > There was one dot at the end of the third line > {code} > .config("spark.some.config.option", "some-value"). > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17952) Java SparkSession createDataFrame method throws exception with nested Javabean
[ https://issues.apache.org/jira/browse/SPARK-17952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Baghel updated SPARK-17952: Summary: Java SparkSession createDataFrame method throws exception with nested Javabean (was: Java SparkSession createDataFrame doesn't work with nested Javabean) > Java SparkSession createDataFrame method throws exception with nested Javabean > -- > > Key: SPARK-17952 > URL: https://issues.apache.org/jira/browse/SPARK-17952 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1 >Reporter: Amit Baghel > > As per latest spark documentation for Java at > http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection, > createDataFrame method supports nested JavaBeans. However this is not > working. Please see below code. > SubCategory class > {code} > public class SubCategory implements Serializable{ > private String id; > private String name; > > public String getId() { > return id; > } > public void setId(String id) { > this.id = id; > } > public String getName() { > return name; > } > public void setName(String name) { > this.name = name; > } > } > {code} > Category class > {code} > public class Category implements Serializable{ > private String id; > private SubCategory subCategory; > > public String getId() { > return id; > } > public void setId(String id) { > this.id = id; > } > public SubCategory getSubCategory() { > return subCategory; > } > public void setSubCategory(SubCategory subCategory) { > this.subCategory = subCategory; > } > } > {code} > SparkSample class > {code} > public class SparkSample { > public static void main(String[] args) throws IOException { > > SparkSession spark = SparkSession > .builder() > .appName("SparkSample") > .master("local") > .getOrCreate(); > //SubCategory > SubCategory sub = new SubCategory(); > sub.setId("sc-111"); > sub.setName("Sub-1"); > //Category > Category category = new Category(); > category.setId("s-111"); > category.setSubCategory(sub); > //categoryList > List categoryList = new ArrayList(); > categoryList.add(category); >//DF > Dataset dframe = spark.createDataFrame(categoryList, > Category.class); > dframe.show(); > } > } > {code} > Above code throws below error. > {code} > Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d > (of class com.sample.SubCategory) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403) > at > org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106) > at > org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106) > at > org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$class.toStream(Iterator.scala:1322) > at scala.collection.AbstractIterator.toStream(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toSeq(TraversableOn
[jira] [Updated] (SPARK-17952) Java SparkSession createDataFrame method throws exception for nested JavaBeans
[ https://issues.apache.org/jira/browse/SPARK-17952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Baghel updated SPARK-17952: Summary: Java SparkSession createDataFrame method throws exception for nested JavaBeans (was: Java SparkSession createDataFrame method throws exception with nested Javabean) > Java SparkSession createDataFrame method throws exception for nested JavaBeans > -- > > Key: SPARK-17952 > URL: https://issues.apache.org/jira/browse/SPARK-17952 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1 >Reporter: Amit Baghel > > As per latest spark documentation for Java at > http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection, > createDataFrame method supports nested JavaBeans. However this is not > working. Please see below code. > SubCategory class > {code} > public class SubCategory implements Serializable{ > private String id; > private String name; > > public String getId() { > return id; > } > public void setId(String id) { > this.id = id; > } > public String getName() { > return name; > } > public void setName(String name) { > this.name = name; > } > } > {code} > Category class > {code} > public class Category implements Serializable{ > private String id; > private SubCategory subCategory; > > public String getId() { > return id; > } > public void setId(String id) { > this.id = id; > } > public SubCategory getSubCategory() { > return subCategory; > } > public void setSubCategory(SubCategory subCategory) { > this.subCategory = subCategory; > } > } > {code} > SparkSample class > {code} > public class SparkSample { > public static void main(String[] args) throws IOException { > > SparkSession spark = SparkSession > .builder() > .appName("SparkSample") > .master("local") > .getOrCreate(); > //SubCategory > SubCategory sub = new SubCategory(); > sub.setId("sc-111"); > sub.setName("Sub-1"); > //Category > Category category = new Category(); > category.setId("s-111"); > category.setSubCategory(sub); > //categoryList > List categoryList = new ArrayList(); > categoryList.add(category); >//DF > Dataset dframe = spark.createDataFrame(categoryList, > Category.class); > dframe.show(); > } > } > {code} > Above code throws below error. > {code} > Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d > (of class com.sample.SubCategory) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403) > at > org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106) > at > org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106) > at > org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$class.toStream(Iterator.scala:1322) > at scala.collection.AbstractIterator.toStream(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toSeq(Tr
[jira] [Updated] (SPARK-17952) Java SparkSession createDataFrame doesn't work with nested Javabean
[ https://issues.apache.org/jira/browse/SPARK-17952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Baghel updated SPARK-17952: Description: As per latest spark documentation for Java at http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection, createDataFrame method supports nested JavaBeans. However this is not working. Please see below code. SubCategory class {code} public class SubCategory implements Serializable{ private String id; private String name; public String getId() { return id; } public void setId(String id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } } {code} Category class {code} public class Category implements Serializable{ private String id; private SubCategory subCategory; public String getId() { return id; } public void setId(String id) { this.id = id; } public SubCategory getSubCategory() { return subCategory; } public void setSubCategory(SubCategory subCategory) { this.subCategory = subCategory; } } {code} SparkSample class {code} public class SparkSample { public static void main(String[] args) throws IOException { SparkSession spark = SparkSession .builder() .appName("SparkSample") .master("local") .getOrCreate(); //SubCategory SubCategory sub = new SubCategory(); sub.setId("sc-111"); sub.setName("Sub-1"); //Category Category category = new Category(); category.setId("s-111"); category.setSubCategory(sub); //categoryList List categoryList = new ArrayList(); categoryList.add(category); //DF Dataset dframe = spark.createDataFrame(categoryList, Category.class); dframe.show(); } } {code} Above code throws below error. {code} Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d (of class com.sample.SubCategory) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103) at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$class.toStream(Iterator.scala:1322) at scala.collection.AbstractIterator.toStream(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298) at scala.collection.AbstractIterator.toSeq(Iterator.scala:1336) at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:373) at com.sample.SparkSample.main(SparkSample.java:33) {code} createDataFrame method throws above exception. But I observed that createDataset method works fine with below code. {code} Encoder encoder = Encoders.bean(Category.class); Dataset dframe = spark.createDataset(categoryList, encoder); dframe.show(); {code} was: As per latest spark documentation at http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-
[jira] [Updated] (SPARK-17952) Java SparkSession createDataFrame doesn't work with nested Javabean
[ https://issues.apache.org/jira/browse/SPARK-17952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Baghel updated SPARK-17952: Description: As per latest spark documentation at http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection, createDataFrame method supports nested JavaBeans. However this is not working. Please see below code. SubCategory class {code} public class SubCategory implements Serializable{ private String id; private String name; public String getId() { return id; } public void setId(String id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } } {code} Category class {code} public class Category implements Serializable{ private String id; private SubCategory subCategory; public String getId() { return id; } public void setId(String id) { this.id = id; } public SubCategory getSubCategory() { return subCategory; } public void setSubCategory(SubCategory subCategory) { this.subCategory = subCategory; } } {code} SparkSample class {code} public class SparkSample { public static void main(String[] args) throws IOException { SparkSession spark = SparkSession .builder() .appName("SparkSample") .master("local") .getOrCreate(); //SubCategory SubCategory sub = new SubCategory(); sub.setId("sc-111"); sub.setName("Sub-1"); //Category Category category = new Category(); category.setId("s-111"); category.setSubCategory(sub); //categoryList List categoryList = new ArrayList(); categoryList.add(category); //DF Dataset dframe = spark.createDataFrame(categoryList, Category.class); dframe.show(); } } {code} Above code throws below error. {code} Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d (of class com.sample.SubCategory) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103) at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$class.toStream(Iterator.scala:1322) at scala.collection.AbstractIterator.toStream(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298) at scala.collection.AbstractIterator.toSeq(Iterator.scala:1336) at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:373) at com.sample.SparkSample.main(SparkSample.java:33) {code} createDataFrame method throws above exception. But I observed that createDataset method works fine with below code. {code} Encoder encoder = Encoders.bean(Category.class); Dataset dframe = spark.createDataset(categoryList, encoder); dframe.show(); {code} was: As per latest spark documentation at http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schem
[jira] [Updated] (SPARK-17952) Java SparkSession createDataFrame doesn't work with nested Javabean
[ https://issues.apache.org/jira/browse/SPARK-17952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Baghel updated SPARK-17952: Description: As per latest spark documentation at http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection, createDataFrame method supports nested JavaBeans. However this is not working. Please see below code. SubCategory class {code} public class SubCategory implements Serializable{ private String id; private String name; public String getId() { return id; } public void setId(String id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } } {code} Category class {code} public class Category implements Serializable{ private String id; private SubCategory subCategory; public String getId() { return id; } public void setId(String id) { this.id = id; } public SubCategory getSubCategory() { return subCategory; } public void setSubCategory(SubCategory subCategory) { this.subCategory = subCategory; } } {code} SparkSample {code} public class SparkSample { public static void main(String[] args) throws IOException { SparkSession spark = SparkSession .builder() .appName("SparkSample") .master("local") .getOrCreate(); //SubCategory SubCategory sub = new SubCategory(); sub.setId("sc-111"); sub.setName("Sub-1"); //Category Category category = new Category(); category.setId("s-111"); category.setSubCategory(sub); //categoryList List categoryList = new ArrayList(); categoryList.add(category); //DF Dataset dframe = spark.createDataFrame(categoryList, Category.class); dframe.show(); } } {code} Above code throws below error. {code} Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d (of class com.sample.SubCategory) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103) at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$class.toStream(Iterator.scala:1322) at scala.collection.AbstractIterator.toStream(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298) at scala.collection.AbstractIterator.toSeq(Iterator.scala:1336) at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:373) at com.sample.SparkSample.main(SparkSample.java:33) {code} createDataFrame method throws above exception. But I observed that createDataset method works fine with below code. {code} Encoder encoder = Encoders.bean(Category.class); Dataset dframe = spark.createDataset(categoryList, encoder); dframe.show(); {code} was: As per latest spark documentation at http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-usi
[jira] [Updated] (SPARK-17952) Java SparkSession createDataFrame doesn't work with nested Javabean
[ https://issues.apache.org/jira/browse/SPARK-17952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Baghel updated SPARK-17952: Description: As per latest spark documentation at http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection, createDataFrame method supports nested JavaBeans. However this is not working. Please see below code. SubCategory class {code} public class SubCategory implements Serializable{ private String id; private String name; public String getId() { return id; } public void setId(String id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } } {code} Category class {code} public class Category implements Serializable{ private String id; private SubCategory subCategory; public String getId() { return id; } public void setId(String id) { this.id = id; } public SubCategory getSubCategory() { return subCategory; } public void setSubCategory(SubCategory subCategory) { this.subCategory = subCategory; } } {code} SparkSample {code} public class SparkSample { public static void main(String[] args) throws IOException { SparkSession spark = SparkSession .builder() .appName("SparkSample") .master("local") .getOrCreate(); //SubCategory SubCategory sub = new SubCategory(); sub.setId("sc-111"); sub.setName("Sub-1"); //Category Category category = new Category(); category.setId("s-111"); category.setSubCategory(sub); //categoryList List categoryList = new ArrayList(); categoryList.add(category); //DF Dataset dframe = spark.createDataFrame(categoryList, Category.class); dframe.show(); } } {code} Above code throws below error. {code} Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d (of class com.sample.SubCategory) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103) at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$class.toStream(Iterator.scala:1322) at scala.collection.AbstractIterator.toStream(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298) at scala.collection.AbstractIterator.toSeq(Iterator.scala:1336) at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:373) at com.sample.SparkSample.main(SparkSample.java:33) {code} {code}createDataFrame{code} method throws above exception. But I observed that {code}createDataset{code} method works fine with below code. {code} Encoder encoder = Encoders.bean(Category.class); Dataset dframe = spark.createDataset(categoryList, encoder); dframe.show(); {code} was: As per latest spark documentation at http://spark.apache.org/docs/latest/sql-programming-guide.html#
[jira] [Created] (SPARK-17953) Fix typo in SparkSession scaladoc
Tae Jun Kim created SPARK-17953: --- Summary: Fix typo in SparkSession scaladoc Key: SPARK-17953 URL: https://issues.apache.org/jira/browse/SPARK-17953 Project: Spark Issue Type: Documentation Components: Documentation Reporter: Tae Jun Kim Priority: Minor At SparkSession builder example, {code} SparkSession.builder() .master("local") .appName("Word Count") .config("spark.some.config.option", "some-value"). .getOrCreate() {code} should be {code} SparkSession.builder() .master("local") .appName("Word Count") .config("spark.some.config.option", "some-value") .getOrCreate() {code} There was one dot at the end of the third line {code} .config("spark.some.config.option", "some-value"). {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17952) Java SparkSession createDataFrame doesn't work with nested Javabean
[ https://issues.apache.org/jira/browse/SPARK-17952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Baghel updated SPARK-17952: Description: As per latest spark documentation at http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection, createDataFrame method supports nested JavaBeans. However this is not working. Please see below code. {code} public class SubCategory implements Serializable{ private String id; private String name; public String getId() { return id; } public void setId(String id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } } {code} {code} public class Category implements Serializable{ private String id; private SubCategory subCategory; public String getId() { return id; } public void setId(String id) { this.id = id; } public SubCategory getSubCategory() { return subCategory; } public void setSubCategory(SubCategory subCategory) { this.subCategory = subCategory; } } {code} {code} public class SparkSample { public static void main(String[] args) throws IOException { SparkSession spark = SparkSession .builder() .appName("SparkSample") .master("local") .getOrCreate(); //SubCategory SubCategory sub = new SubCategory(); sub.setId("sc-111"); sub.setName("Sub-1"); //Category Category category = new Category(); category.setId("s-111"); category.setSubCategory(sub); //categoryList List categoryList = new ArrayList(); categoryList.add(category); //DF Dataset dframe = spark.createDataFrame(categoryList, Category.class); dframe.show(); } } {code} Above code throws below error. {code} Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d (of class com.sample.SubCategory) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103) at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$class.toStream(Iterator.scala:1322) at scala.collection.AbstractIterator.toStream(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298) at scala.collection.AbstractIterator.toSeq(Iterator.scala:1336) at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:373) at com.sample.SparkSample.main(SparkSample.java:33) {code} createDataFrame method throws above error. But I observed that createDataset method works fine with below code. {code} Encoder encoder = Encoders.bean(Category.class); Dataset dframe = spark.createDataset(categoryList, encoder); dframe.show(); {code} was: As per latest spark documentation at http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection, createDataFrame method supports nes
[jira] [Created] (SPARK-17952) Java SparkSession createDataFrame doesn't work with nested Javabean
Amit Baghel created SPARK-17952: --- Summary: Java SparkSession createDataFrame doesn't work with nested Javabean Key: SPARK-17952 URL: https://issues.apache.org/jira/browse/SPARK-17952 Project: Spark Issue Type: Bug Affects Versions: 2.0.1, 2.0.0 Reporter: Amit Baghel As per latest spark documentation at http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection, createDataFrame method supports nested JavaBeans. However this is not working. Please see below code. {code} public class SubCategory implements Serializable{ private String id; private String name; public String getId() { return id; } public void setId(String id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } } {code} {code} public class Category implements Serializable{ private String id; private SubCategory subCategory; public String getId() { return id; } public void setId(String id) { this.id = id; } public SubCategory getSubCategory() { return subCategory; } public void setSubCategory(SubCategory subCategory) { this.subCategory = subCategory; } } {code} {code} public class SparkSample { public static void main(String[] args) throws IOException { SparkSession spark = SparkSession .builder() .appName("SparkSample") .master("local") .getOrCreate(); //SubCategory SubCategory sub = new SubCategory(); sub.setId("sc-111"); sub.setName("Sub-1"); //Category Category category = new Category(); category.setId("s-111"); category.setSubCategory(sub); //categoryList List categoryList = new ArrayList(); categoryList.add(category); //DF Dataset dframe = spark.createDataFrame(categoryList, Category.class); dframe.show(); } } {code} Above code throws below error. {code} Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d (of class com.sample.SubCategory) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103) at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$class.toStream(Iterator.scala:1322) at scala.collection.AbstractIterator.toStream(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298) at scala.collection.AbstractIterator.toSeq(Iterator.scala:1336) at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:373) at com.sample.SparkSample.main(SparkSample.java:33) {code} createDataFrame method throws above error. But I observed that createDataset method works fine with below code. {code} Encoder encoder = Encoders.bean(Category.class); Dataset dframe = spark.createDataset(categoryList, encoder);