[jira] [Created] (SPARK-16704) Union does not work for column with array byte
Ng Jiunn Jye created SPARK-16704: Summary: Union does not work for column with array byte Key: SPARK-16704 URL: https://issues.apache.org/jira/browse/SPARK-16704 Project: Spark Issue Type: Bug Reporter: Ng Jiunn Jye When union 2 query with columns having array of bytes datatype, spark query fail with exception. Example : select binaryColumn from tableA union select binaryColumn from tableB Note that spark properties "spark.sql.parquet.binaryAsString" is set to true org.apache.spark.sql.AnalysisException: unresolved operator 'Union; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) ~[iop-spark-client.spark-catalyst_2.11-1.6.0.jar:1.6.0] at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) ~[iop-spark-client.spark-catalyst_2.11-1.6.0.jar:1.6.0] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:203) ~[iop-spark-client.spark-catalyst_2.11-1.6.0.jar:1.6.0] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50) ~[iop-spark-client.spark-catalyst_2.11-1.6.0.jar:1.6.0] at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105) ~[iop-spark-client.spark-catalyst_2.11-1.6.0.jar:1.6.0] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:104) ~[iop-spark-client.spark-catalyst_2.11-1.6.0.jar:1.6.0] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:104) ~[iop-spark-client.spark-catalyst_2.11-1.6.0.jar:1.6.0] at scala.collection.immutable.List.foreach(List.scala:381) ~[org.scala-lang.scala-library-2.11.8.jar:na] at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:104) ~[iop-spark-client.spark-catalyst_2.11-1.6.0.jar:1.6.0] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50) ~[iop-spark-client.spark-catalyst_2.11-1.6.0.jar:1.6.0] at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) ~[iop-spark-client.spark-catalyst_2.11-1.6.0.jar:1.6.0] at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34) ~[iop-spark-client.spark-sql_2.11-1.6.0.jar:1.6.0] at org.apache.spark.sql.DataFrame.(DataFrame.scala:133) ~[iop-spark-client.spark-sql_2.11-1.6.0.jar:1.6.0] at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) ~[iop-spark-client.spark-sql_2.11-1.6.0.jar:1.6.0] at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817) ~[iop-spark-client.spark-sql_2.11-1.6.0.jar:1.6.0] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16702) Driver hangs after executors are lost
[ https://issues.apache.org/jira/browse/SPARK-16702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391409#comment-15391409 ] Sean Owen commented on SPARK-16702: --- It all sounds plausible. Have a look at https://issues.apache.org/jira/browse/SPARK-12419 and https://issues.apache.org/jira/browse/SPARK-16533 which may not be exactly the same thing but worth comparing. Anything that simplifies or untangles blocking callbacks would probably be a win, though it's always delicate surgery. Go ahead. > Driver hangs after executors are lost > - > > Key: SPARK-16702 > URL: https://issues.apache.org/jira/browse/SPARK-16702 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Angus Gerry > Attachments: SparkThreadsBlocked.txt > > > It's my first time, please be kind. > I'm still trying to debug this error locally - at this stage I'm pretty > convinced that it's a weird deadlock/livelock problem due to the use of > {{scheduleAtFixedRate}} within {{ExecutorAllocationManager}}. This problem is > possibly tangentially related to the issues discussed in SPARK-1560 around > the use of blocking calls within locks. > h4. Observed Behavior > When running a spark job, and executors are lost, the job occassionally goes > into a state where it makes no progress with tasks. Most commonly it seems > that the issue occurs when executors are preempted by yarn, but I'm not > confident enough to state that it's restricted to just this scenario. > Upon inspecting a thread dump from the driver, the following stack traces > seem noteworthy (a full thread dump is attached): > {noformat:title=Thread 178: spark-dynamic-executor-allocation (TIMED_WAITING)} > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033) > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > scala.concurrent.Await$.result(package.scala:190) > org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) > org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59) > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:447) > org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1423) > org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359) > org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310) > org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264) > org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > {noformat} > {noformat:title=Thread 22: dispatcher-event-loop-10 (BLOCKED)} > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:289) > org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121) > org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120) > scala.Option.foreach(Option.scala:257) > org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120) > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142) >
[jira] [Assigned] (SPARK-16703) Extra space in WindowSpecDefinition SQL representation
[ https://issues.apache.org/jira/browse/SPARK-16703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16703: Assignee: Apache Spark (was: Cheng Lian) > Extra space in WindowSpecDefinition SQL representation > -- > > Key: SPARK-16703 > URL: https://issues.apache.org/jira/browse/SPARK-16703 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Apache Spark >Priority: Minor > > For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an > extra space in its SQL representation: > {code:sql} > ( ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16703) Extra space in WindowSpecDefinition SQL representation
[ https://issues.apache.org/jira/browse/SPARK-16703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391370#comment-15391370 ] Apache Spark commented on SPARK-16703: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/14334 > Extra space in WindowSpecDefinition SQL representation > -- > > Key: SPARK-16703 > URL: https://issues.apache.org/jira/browse/SPARK-16703 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > > For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an > extra space in its SQL representation: > {code:sql} > ( ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16703) Extra space in WindowSpecDefinition SQL representation
[ https://issues.apache.org/jira/browse/SPARK-16703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16703: Assignee: Cheng Lian (was: Apache Spark) > Extra space in WindowSpecDefinition SQL representation > -- > > Key: SPARK-16703 > URL: https://issues.apache.org/jira/browse/SPARK-16703 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > > For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an > extra space in its SQL representation: > {code:sql} > ( ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16703) Extra space in WindowSpecDefinition SQL representation
[ https://issues.apache.org/jira/browse/SPARK-16703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-16703: --- Description: For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an extra space in its SQL representation: {code:sql} {code} was: For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an extra space in its SQL representation: {code:sql} {code} > Extra space in WindowSpecDefinition SQL representation > -- > > Key: SPARK-16703 > URL: https://issues.apache.org/jira/browse/SPARK-16703 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > > For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an > extra space in its SQL representation: > {code:sql} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16703) Extra space in WindowSpecDefinition SQL representation
[ https://issues.apache.org/jira/browse/SPARK-16703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-16703: --- Description: For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an extra space in its SQL representation: {code:sql} ( ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) {code} was: For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an extra space in its SQL representation: {code:sql} {code} > Extra space in WindowSpecDefinition SQL representation > -- > > Key: SPARK-16703 > URL: https://issues.apache.org/jira/browse/SPARK-16703 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > > For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an > extra space in its SQL representation: > {code:sql} > ( ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16703) Extra space in WindowSpecDefinition SQL representation
Cheng Lian created SPARK-16703: -- Summary: Extra space in WindowSpecDefinition SQL representation Key: SPARK-16703 URL: https://issues.apache.org/jira/browse/SPARK-16703 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an extra space in its SQL representation: {code:sql} {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16685) audit release docs are ambiguous
[ https://issues.apache.org/jira/browse/SPARK-16685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16685: Assignee: Apache Spark > audit release docs are ambiguous > > > Key: SPARK-16685 > URL: https://issues.apache.org/jira/browse/SPARK-16685 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 1.6.2 >Reporter: jay vyas >Assignee: Apache Spark >Priority: Minor > > The dev/audit-release tooling is ambiguous. > - should it run against a real cluster? if so when? > - what should be in the release repo? Just jars? tarballs? ( i assume jars > because its a .ivy, but not sure). > - > https://github.com/apache/spark/tree/master/dev/audit-release -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16685) audit release docs are ambiguous
[ https://issues.apache.org/jira/browse/SPARK-16685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391361#comment-15391361 ] Apache Spark commented on SPARK-16685: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/14342 > audit release docs are ambiguous > > > Key: SPARK-16685 > URL: https://issues.apache.org/jira/browse/SPARK-16685 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 1.6.2 >Reporter: jay vyas >Priority: Minor > > The dev/audit-release tooling is ambiguous. > - should it run against a real cluster? if so when? > - what should be in the release repo? Just jars? tarballs? ( i assume jars > because its a .ivy, but not sure). > - > https://github.com/apache/spark/tree/master/dev/audit-release -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16685) audit release docs are ambiguous
[ https://issues.apache.org/jira/browse/SPARK-16685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16685: Assignee: (was: Apache Spark) > audit release docs are ambiguous > > > Key: SPARK-16685 > URL: https://issues.apache.org/jira/browse/SPARK-16685 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 1.6.2 >Reporter: jay vyas >Priority: Minor > > The dev/audit-release tooling is ambiguous. > - should it run against a real cluster? if so when? > - what should be in the release repo? Just jars? tarballs? ( i assume jars > because its a .ivy, but not sure). > - > https://github.com/apache/spark/tree/master/dev/audit-release -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16699) Fix performance bug in hash aggregate on long string keys
[ https://issues.apache.org/jira/browse/SPARK-16699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16699: Description: In the following code in `VectorizedHashMapGenerator.scala`: {code} def hashBytes(b: String): String = { val hash = ctx.freshName("hash") s""" |int $result = 0; |for (int i = 0; i < $b.length; i++) { | ${genComputeHash(ctx, s"$b[i]", ByteType, hash)} | $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + ($result >>> 2); |} """.stripMargin } {code} when b=input.getBytes(), the current 2.0 code results in getBytes() being called n times, n being length of input. getBytes() involves memory copy is thus expensive and causes a performance degradation. Fix is to evaluate getBytes() before the for loop. was: In the following code in `VectorizedHashMapGenerator.scala`: ``` def hashBytes(b: String): String = { val hash = ctx.freshName("hash") s""" |int $result = 0; |for (int i = 0; i < $b.length; i++) { | ${genComputeHash(ctx, s"$b[i]", ByteType, hash)} | $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + ($result >>> 2); |} """.stripMargin } ``` when b=input.getBytes(), the current 2.0 code results in getBytes() being called n times, n being length of input. getBytes() involves memory copy is thus expensive and causes a performance degradation. Fix is to evaluate getBytes() before the for loop. > Fix performance bug in hash aggregate on long string keys > - > > Key: SPARK-16699 > URL: https://issues.apache.org/jira/browse/SPARK-16699 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Qifan Pu >Assignee: Qifan Pu > Fix For: 2.0.1, 2.1.0 > > > In the following code in `VectorizedHashMapGenerator.scala`: > {code} > def hashBytes(b: String): String = { > val hash = ctx.freshName("hash") > s""" > |int $result = 0; > |for (int i = 0; i < $b.length; i++) { > | ${genComputeHash(ctx, s"$b[i]", ByteType, hash)} > | $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + > ($result >>> 2); > |} >""".stripMargin > } > {code} > when b=input.getBytes(), the current 2.0 code results in getBytes() being > called n times, n being length of input. getBytes() involves memory copy is > thus expensive and causes a performance degradation. > Fix is to evaluate getBytes() before the for loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16699) Fix performance bug in hash aggregate on long string keys
[ https://issues.apache.org/jira/browse/SPARK-16699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16699. - Resolution: Fixed Assignee: Qifan Pu Fix Version/s: (was: 2.0.0) 2.1.0 2.0.1 > Fix performance bug in hash aggregate on long string keys > - > > Key: SPARK-16699 > URL: https://issues.apache.org/jira/browse/SPARK-16699 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Qifan Pu >Assignee: Qifan Pu > Fix For: 2.0.1, 2.1.0 > > > In the following code in `VectorizedHashMapGenerator.scala`: > ``` > def hashBytes(b: String): String = { > val hash = ctx.freshName("hash") > s""" > |int $result = 0; > |for (int i = 0; i < $b.length; i++) { > | ${genComputeHash(ctx, s"$b[i]", ByteType, hash)} > | $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + > ($result >>> 2); > |} >""".stripMargin > } > ``` > when b=input.getBytes(), the current 2.0 code results in getBytes() being > called n times, n being length of input. getBytes() involves memory copy is > thus expensive and causes a performance degradation. > Fix is to evaluate getBytes() before the for loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16702) Driver hangs after executors are lost
[ https://issues.apache.org/jira/browse/SPARK-16702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Angus Gerry updated SPARK-16702: Attachment: SparkThreadsBlocked.txt > Driver hangs after executors are lost > - > > Key: SPARK-16702 > URL: https://issues.apache.org/jira/browse/SPARK-16702 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Angus Gerry > Attachments: SparkThreadsBlocked.txt > > > It's my first time, please be kind. > I'm still trying to debug this error locally - at this stage I'm pretty > convinced that it's a weird deadlock/livelock problem due to the use of > {{scheduleAtFixedRate}} within {{ExecutorAllocationManager}}. This problem is > possibly tangentially related to the issues discussed in SPARK-1560 around > the use of blocking calls within locks. > h4. Observed Behavior > When running a spark job, and executors are lost, the job occassionally goes > into a state where it makes no progress with tasks. Most commonly it seems > that the issue occurs when executors are preempted by yarn, but I'm not > confident enough to state that it's restricted to just this scenario. > Upon inspecting a thread dump from the driver, the following stack traces > seem noteworthy (a full thread dump is attached): > {noformat:title=Thread 178: spark-dynamic-executor-allocation (TIMED_WAITING)} > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033) > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > scala.concurrent.Await$.result(package.scala:190) > org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) > org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59) > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:447) > org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1423) > org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359) > org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310) > org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264) > org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > {noformat} > {noformat:title=Thread 22: dispatcher-event-loop-10 (BLOCKED)} > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:289) > org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121) > org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120) > scala.Option.foreach(Option.scala:257) > org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120) > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142) > org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) > org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
[jira] [Created] (SPARK-16702) Driver hangs after executors are lost
Angus Gerry created SPARK-16702: --- Summary: Driver hangs after executors are lost Key: SPARK-16702 URL: https://issues.apache.org/jira/browse/SPARK-16702 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0, 1.6.2, 1.6.1 Reporter: Angus Gerry It's my first time, please be kind. I'm still trying to debug this error locally - at this stage I'm pretty convinced that it's a weird deadlock/livelock problem due to the use of {{scheduleAtFixedRate}} within {{ExecutorAllocationManager}}. This problem is possibly tangentially related to the issues discussed in SPARK-1560 around the use of blocking calls within locks. h4. Observed Behavior When running a spark job, and executors are lost, the job occassionally goes into a state where it makes no progress with tasks. Most commonly it seems that the issue occurs when executors are preempted by yarn, but I'm not confident enough to state that it's restricted to just this scenario. Upon inspecting a thread dump from the driver, the following stack traces seem noteworthy (a full thread dump is attached): {noformat:title=Thread 178: spark-dynamic-executor-allocation (TIMED_WAITING)} sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033) java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) scala.concurrent.Await$.result(package.scala:190) org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59) org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:447) org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1423) org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359) org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310) org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264) org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {noformat} {noformat:title=Thread 22: dispatcher-event-loop-10 (BLOCKED)} org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:289) org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121) org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120) scala.Option.foreach(Option.scala:257) org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120) org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142) org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {noformat} {noformat:title=Thread 640: kill-executor-thread (BLOCKED)} org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.killExecutors(CoarseGrainedSchedulerBackend.scala:488) org.apache.spark.SparkContext.killAndReplaceExecutor(SparkContext.scala:1499) org.apache.spark.HeartbeatRec
[jira] [Commented] (SPARK-16534) Kafka 0.10 Python support
[ https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391274#comment-15391274 ] Apache Spark commented on SPARK-16534: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/14340 > Kafka 0.10 Python support > - > > Key: SPARK-16534 > URL: https://issues.apache.org/jira/browse/SPARK-16534 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16534) Kafka 0.10 Python support
[ https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16534: Assignee: (was: Apache Spark) > Kafka 0.10 Python support > - > > Key: SPARK-16534 > URL: https://issues.apache.org/jira/browse/SPARK-16534 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16534) Kafka 0.10 Python support
[ https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16534: Assignee: Apache Spark > Kafka 0.10 Python support > - > > Key: SPARK-16534 > URL: https://issues.apache.org/jira/browse/SPARK-16534 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Tathagata Das >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5581) When writing sorted map output file, avoid open / close between each partition
[ https://issues.apache.org/jira/browse/SPARK-5581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5581: -- Assignee: Josh Rosen > When writing sorted map output file, avoid open / close between each partition > -- > > Key: SPARK-5581 > URL: https://issues.apache.org/jira/browse/SPARK-5581 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 1.3.0 >Reporter: Sandy Ryza >Assignee: Josh Rosen > Fix For: 2.1.0 > > > {code} > // Bypassing merge-sort; get an iterator by partition and just write > everything directly. > for ((id, elements) <- this.partitionedIterator) { > if (elements.hasNext) { > val writer = blockManager.getDiskWriter( > blockId, outputFile, ser, fileBufferSize, > context.taskMetrics.shuffleWriteMetrics.get) > for (elem <- elements) { > writer.write(elem) > } > writer.commitAndClose() > val segment = writer.fileSegment() > lengths(id) = segment.length > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5581) When writing sorted map output file, avoid open / close between each partition
[ https://issues.apache.org/jira/browse/SPARK-5581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5581: -- Assignee: Brian Cho (was: Josh Rosen) > When writing sorted map output file, avoid open / close between each partition > -- > > Key: SPARK-5581 > URL: https://issues.apache.org/jira/browse/SPARK-5581 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 1.3.0 >Reporter: Sandy Ryza >Assignee: Brian Cho > Fix For: 2.1.0 > > > {code} > // Bypassing merge-sort; get an iterator by partition and just write > everything directly. > for ((id, elements) <- this.partitionedIterator) { > if (elements.hasNext) { > val writer = blockManager.getDiskWriter( > blockId, outputFile, ser, fileBufferSize, > context.taskMetrics.shuffleWriteMetrics.get) > for (elem <- elements) { > writer.write(elem) > } > writer.commitAndClose() > val segment = writer.fileSegment() > lengths(id) = segment.length > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16603) Spark2.0 fail in executing the sql statement which field name begins with number,like "d.30_day_loss_user" while spark1.6 supports
[ https://issues.apache.org/jira/browse/SPARK-16603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391268#comment-15391268 ] keliang edited comment on SPARK-16603 at 7/25/16 2:46 AM: -- Hi, I test this feature with spark-2.0.1-snapshot: first. create testing table tsp with column name [10_user_age: Int, 20_user_addr: String], and success == CREATE TABLE `tsp`(`10_user_age` int, `20_user_addr` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1' ) STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' TBLPROPERTIES ( 'transient_lastDdlTime' = '1469418111' ) second, query tsp with "select * from tsp where tsp.20_user_addr <10". get error: Error in query: mismatched input '.20' expecting {, '.', '[', 'GROUP', 'ORDER', 'HAVING', 'LIMIT', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 'IS', 'WINDOW', 'UNION', 'EXCEPT', 'INTERSECT', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '^', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 27) == SQL == select * from tsp where tsp.20_user_addr <10 ---^^^ Question is Why using Digit prefix in the column name to create table passed the validation rules, but failed when query table ? self-contradiction ? was (Author: biglobster): Hi, I test this feature with spark-2.0.1-snapshot: first. create testing table tsp with column name [10_user_age: Int, 20_user_addr: String], and success == CREATE TABLE `tsp`(`10_user_age` int, `20_user_addr` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1' ) STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' TBLPROPERTIES ( 'transient_lastDdlTime' = '1469418111' ) second, query tsp with "select * from tsp where tsp.20_user_addr <10". get error: Error in query: mismatched input '.20' expecting {, '.', '[', 'GROUP', 'ORDER', 'HAVING', 'LIMIT', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 'IS', 'WINDOW', 'UNION', 'EXCEPT', 'INTERSECT', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '^', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 27) == SQL == select * from tsp where tsp.20_user_addr <10 ---^^^ Question is Why using Digit prefix in the column name to create table passed the validation rules, but failed when query table ? > Spark2.0 fail in executing the sql statement which field name begins with > number,like "d.30_day_loss_user" while spark1.6 supports > -- > > Key: SPARK-16603 > URL: https://issues.apache.org/jira/browse/SPARK-16603 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: marymwu >Priority: Minor > > Spark2.0 fail in executing the sql statement which field name begins with > number,like "d.30_day_loss_user" while spark1.6 supports > Error: org.apache.spark.sql.catalyst.parser.ParseException: mismatched input > '.30' expecting > {')', ','} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16603) Spark2.0 fail in executing the sql statement which field name begins with number,like "d.30_day_loss_user" while spark1.6 supports
[ https://issues.apache.org/jira/browse/SPARK-16603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391268#comment-15391268 ] keliang commented on SPARK-16603: - Hi, I test this feature with spark-2.0.1-snapshot: first. create testing table tsp with column name [10_user_age: Int, 20_user_addr: String], and success == CREATE TABLE `tsp`(`10_user_age` int, `20_user_addr` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1' ) STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' TBLPROPERTIES ( 'transient_lastDdlTime' = '1469418111' ) second, query tsp with "select * from tsp where tsp.20_user_addr <10". get error: Error in query: mismatched input '.20' expecting {, '.', '[', 'GROUP', 'ORDER', 'HAVING', 'LIMIT', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 'IS', 'WINDOW', 'UNION', 'EXCEPT', 'INTERSECT', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '^', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 27) == SQL == select * from tsp where tsp.20_user_addr <10 ---^^^ Question is Why using Digit prefix in the column name to create table passed the validation rules, but failed when query table ? > Spark2.0 fail in executing the sql statement which field name begins with > number,like "d.30_day_loss_user" while spark1.6 supports > -- > > Key: SPARK-16603 > URL: https://issues.apache.org/jira/browse/SPARK-16603 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: marymwu >Priority: Minor > > Spark2.0 fail in executing the sql statement which field name begins with > number,like "d.30_day_loss_user" while spark1.6 supports > Error: org.apache.spark.sql.catalyst.parser.ParseException: mismatched input > '.30' expecting > {')', ','} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5581) When writing sorted map output file, avoid open / close between each partition
[ https://issues.apache.org/jira/browse/SPARK-5581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-5581. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 13382 [https://github.com/apache/spark/pull/13382] > When writing sorted map output file, avoid open / close between each partition > -- > > Key: SPARK-5581 > URL: https://issues.apache.org/jira/browse/SPARK-5581 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 1.3.0 >Reporter: Sandy Ryza > Fix For: 2.1.0 > > > {code} > // Bypassing merge-sort; get an iterator by partition and just write > everything directly. > for ((id, elements) <- this.partitionedIterator) { > if (elements.hasNext) { > val writer = blockManager.getDiskWriter( > blockId, outputFile, ser, fileBufferSize, > context.taskMetrics.shuffleWriteMetrics.get) > for (elem <- elements) { > writer.write(elem) > } > writer.commitAndClose() > val segment = writer.fileSegment() > lengths(id) = segment.length > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16698) json parsing regression - "." in keys
[ https://issues.apache.org/jira/browse/SPARK-16698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16698: Assignee: Apache Spark > json parsing regression - "." in keys > - > > Key: SPARK-16698 > URL: https://issues.apache.org/jira/browse/SPARK-16698 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: TobiasP >Assignee: Apache Spark > > The commit 83775bc78e183791f75a99cdfbcd68a67ca0d472 "\[SPARK-14158]\[SQL] > implement buildReader for json data source" breaks parsing of json files with > "." in keys. > E.g. the test input for spark-solr > https://github.com/lucidworks/spark-solr/blob/master/src/test/resources/test-data/events.json > {noformat} > scala> > sqlContext.read.json("src/test/resources/test-data/events.json").collectAsList > org.apache.spark.sql.AnalysisException: Unable to resolve params.title_s > given [_version_, count_l, doc_id_s, flag_s, id, params.title_s, > params.url_s, session_id_s, timestamp_tdt, type_s, tz_timestamp_txt, > user_id_s]; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:130) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:126) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at org.apache.spark.sql.types.StructType.map(StructType.scala:94) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126) > at > org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:80) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) > at > org.apache.spark.sql.execution.SparkStrategies$SpecialLimits$.apply(SparkStrategies.scala:53) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:52) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:50) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:57) > at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2321) > at org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2040) > ... 49 elided > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16698) json parsing regression - "." in keys
[ https://issues.apache.org/jira/browse/SPARK-16698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16698: Assignee: (was: Apache Spark) > json parsing regression - "." in keys > - > > Key: SPARK-16698 > URL: https://issues.apache.org/jira/browse/SPARK-16698 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: TobiasP > > The commit 83775bc78e183791f75a99cdfbcd68a67ca0d472 "\[SPARK-14158]\[SQL] > implement buildReader for json data source" breaks parsing of json files with > "." in keys. > E.g. the test input for spark-solr > https://github.com/lucidworks/spark-solr/blob/master/src/test/resources/test-data/events.json > {noformat} > scala> > sqlContext.read.json("src/test/resources/test-data/events.json").collectAsList > org.apache.spark.sql.AnalysisException: Unable to resolve params.title_s > given [_version_, count_l, doc_id_s, flag_s, id, params.title_s, > params.url_s, session_id_s, timestamp_tdt, type_s, tz_timestamp_txt, > user_id_s]; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:130) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:126) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at org.apache.spark.sql.types.StructType.map(StructType.scala:94) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126) > at > org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:80) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) > at > org.apache.spark.sql.execution.SparkStrategies$SpecialLimits$.apply(SparkStrategies.scala:53) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:52) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:50) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:57) > at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2321) > at org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2040) > ... 49 elided > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16698) json parsing regression - "." in keys
[ https://issues.apache.org/jira/browse/SPARK-16698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391264#comment-15391264 ] Apache Spark commented on SPARK-16698: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/14339 > json parsing regression - "." in keys > - > > Key: SPARK-16698 > URL: https://issues.apache.org/jira/browse/SPARK-16698 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: TobiasP > > The commit 83775bc78e183791f75a99cdfbcd68a67ca0d472 "\[SPARK-14158]\[SQL] > implement buildReader for json data source" breaks parsing of json files with > "." in keys. > E.g. the test input for spark-solr > https://github.com/lucidworks/spark-solr/blob/master/src/test/resources/test-data/events.json > {noformat} > scala> > sqlContext.read.json("src/test/resources/test-data/events.json").collectAsList > org.apache.spark.sql.AnalysisException: Unable to resolve params.title_s > given [_version_, count_l, doc_id_s, flag_s, id, params.title_s, > params.url_s, session_id_s, timestamp_tdt, type_s, tz_timestamp_txt, > user_id_s]; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:130) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:126) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at org.apache.spark.sql.types.StructType.map(StructType.scala:94) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126) > at > org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:80) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) > at > org.apache.spark.sql.execution.SparkStrategies$SpecialLimits$.apply(SparkStrategies.scala:53) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:52) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:50) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:57) > at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2321) > at org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2040) > ... 49 elided > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16701) Make parameters configurable in BlockManager
[ https://issues.apache.org/jira/browse/SPARK-16701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16701: Assignee: (was: Apache Spark) > Make parameters configurable in BlockManager > > > Key: SPARK-16701 > URL: https://issues.apache.org/jira/browse/SPARK-16701 > Project: Spark > Issue Type: Improvement >Reporter: YangyangLiu >Priority: Minor > > Make parameters configurable in BlockManager class, such as max_attempts and > sleep_time -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16701) Make parameters configurable in BlockManager
[ https://issues.apache.org/jira/browse/SPARK-16701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391247#comment-15391247 ] Apache Spark commented on SPARK-16701: -- User 'lovexi' has created a pull request for this issue: https://github.com/apache/spark/pull/14338 > Make parameters configurable in BlockManager > > > Key: SPARK-16701 > URL: https://issues.apache.org/jira/browse/SPARK-16701 > Project: Spark > Issue Type: Improvement >Reporter: YangyangLiu >Priority: Minor > > Make parameters configurable in BlockManager class, such as max_attempts and > sleep_time -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16701) Make parameters configurable in BlockManager
[ https://issues.apache.org/jira/browse/SPARK-16701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16701: Assignee: Apache Spark > Make parameters configurable in BlockManager > > > Key: SPARK-16701 > URL: https://issues.apache.org/jira/browse/SPARK-16701 > Project: Spark > Issue Type: Improvement >Reporter: YangyangLiu >Assignee: Apache Spark >Priority: Minor > > Make parameters configurable in BlockManager class, such as max_attempts and > sleep_time -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16701) Make parameters configurable in BlockManager
YangyangLiu created SPARK-16701: --- Summary: Make parameters configurable in BlockManager Key: SPARK-16701 URL: https://issues.apache.org/jira/browse/SPARK-16701 Project: Spark Issue Type: Improvement Reporter: YangyangLiu Priority: Minor Make parameters configurable in BlockManager class, such as max_attempts and sleep_time -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16700) StructType doesn't accept Python dicts anymore
[ https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sylvain Zimmer updated SPARK-16700: --- Component/s: (was: Spark Core) > StructType doesn't accept Python dicts anymore > -- > > Key: SPARK-16700 > URL: https://issues.apache.org/jira/browse/SPARK-16700 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello, > I found this issue while testing my codebase with 2.0.0-rc5 > StructType in Spark 1.6.2 accepts the Python type, which is very > handy. 2.0.0-rc5 does not and throws an error. > I don't know if this was intended but I'd advocate for this behaviour to > remain the same. MapType is probably wasteful when your key names never > change and switching to Python tuples would be cumbersome. > Here is a minimal script to reproduce the issue: > {code} > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > struct_schema = SparkTypes.StructType([ > SparkTypes.StructField("id", SparkTypes.LongType()) > ]) > rdd = sc.parallelize([{"id": 0}, {"id": 1}]) > df = sqlc.createDataFrame(rdd, struct_schema) > print df.collect() > # 1.6.2 prints [Row(id=0), Row(id=1)] > # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in > type > {code} > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16700) StructType doesn't accept Python dicts anymore
[ https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sylvain Zimmer updated SPARK-16700: --- Description: Hello, I found this issue while testing my codebase with 2.0.0-rc5 StructType in Spark 1.6.2 accepts the Python type, which is very handy. 2.0.0-rc5 does not and throws an error. I don't know if this was intended but I'd advocate for this behaviour to remain the same. MapType is probably wasteful when your key names never change and switching to Python tuples would be cumbersome. Here is a minimal script to reproduce the issue: {code} from pyspark import SparkContext from pyspark.sql import types as SparkTypes from pyspark.sql import SQLContext sc = SparkContext() sqlc = SQLContext(sc) struct_schema = SparkTypes.StructType([ SparkTypes.StructField("id", SparkTypes.LongType()) ]) rdd = sc.parallelize([{"id": 0}, {"id": 1}]) df = sqlc.createDataFrame(rdd, struct_schema) print df.collect() # 1.6.2 prints [Row(id=0), Row(id=1)] # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in type {code} Thanks! was: Hello, I found this issue while testing my codebase with 2.0.0-rc5 StructType in Spark 1.6.2 accepts the Python type, which is very handy. 2.0.0-rc5 does not and throws an error. I don't know if this was intended but I'd advocate for this behaviour to remain the same. MapType is probably wasteful when your key names never change and switching to Python tuples would be cumbersome. Here is a minimal script to reproduce the issue: {code:python} from pyspark import SparkContext from pyspark.sql import types as SparkTypes from pyspark.sql import SQLContext sc = SparkContext() sqlc = SQLContext(sc) struct_schema = SparkTypes.StructType([ SparkTypes.StructField("id", SparkTypes.LongType()) ]) rdd = sc.parallelize([{"id": 0}, {"id": 1}]) df = sqlc.createDataFrame(rdd, struct_schema) print df.collect() # 1.6.2 prints [Row(id=0), Row(id=1)] # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in type {code} Thanks! > StructType doesn't accept Python dicts anymore > -- > > Key: SPARK-16700 > URL: https://issues.apache.org/jira/browse/SPARK-16700 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello, > I found this issue while testing my codebase with 2.0.0-rc5 > StructType in Spark 1.6.2 accepts the Python type, which is very > handy. 2.0.0-rc5 does not and throws an error. > I don't know if this was intended but I'd advocate for this behaviour to > remain the same. MapType is probably wasteful when your key names never > change and switching to Python tuples would be cumbersome. > Here is a minimal script to reproduce the issue: > {code} > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > struct_schema = SparkTypes.StructType([ > SparkTypes.StructField("id", SparkTypes.LongType()) > ]) > rdd = sc.parallelize([{"id": 0}, {"id": 1}]) > df = sqlc.createDataFrame(rdd, struct_schema) > print df.collect() > # 1.6.2 prints [Row(id=0), Row(id=1)] > # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in > type > {code} > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16700) StructType doesn't accept Python dicts anymore
[ https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sylvain Zimmer updated SPARK-16700: --- Component/s: PySpark > StructType doesn't accept Python dicts anymore > -- > > Key: SPARK-16700 > URL: https://issues.apache.org/jira/browse/SPARK-16700 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello, > I found this issue while testing my codebase with 2.0.0-rc5 > StructType in Spark 1.6.2 accepts the Python type, which is very > handy. 2.0.0-rc5 does not and throws an error. > I don't know if this was intended but I'd advocate for this behaviour to > remain the same. MapType is probably wasteful when your key names never > change and switching to Python tuples would be cumbersome. > Here is a minimal script to reproduce the issue: > {code} > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > struct_schema = SparkTypes.StructType([ > SparkTypes.StructField("id", SparkTypes.LongType()) > ]) > rdd = sc.parallelize([{"id": 0}, {"id": 1}]) > df = sqlc.createDataFrame(rdd, struct_schema) > print df.collect() > # 1.6.2 prints [Row(id=0), Row(id=1)] > # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in > type > {code} > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16700) StructType doesn't accept Python dicts anymore
Sylvain Zimmer created SPARK-16700: -- Summary: StructType doesn't accept Python dicts anymore Key: SPARK-16700 URL: https://issues.apache.org/jira/browse/SPARK-16700 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0 Reporter: Sylvain Zimmer Hello, I found this issue while testing my codebase with 2.0.0-rc5 StructType in Spark 1.6.2 accepts the Python type, which is very handy. 2.0.0-rc5 does not and throws an error. I don't know if this was intended but I'd advocate for this behaviour to remain the same. MapType is probably wasteful when your key names never change and switching to Python tuples would be cumbersome. Here is a minimal script to reproduce the issue: {code:python} from pyspark import SparkContext from pyspark.sql import types as SparkTypes from pyspark.sql import SQLContext sc = SparkContext() sqlc = SQLContext(sc) struct_schema = SparkTypes.StructType([ SparkTypes.StructField("id", SparkTypes.LongType()) ]) rdd = sc.parallelize([{"id": 0}, {"id": 1}]) df = sqlc.createDataFrame(rdd, struct_schema) print df.collect() # 1.6.2 prints [Row(id=0), Row(id=1)] # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in type {code} Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16699) Fix performance bug in hash aggregate on long string keys
[ https://issues.apache.org/jira/browse/SPARK-16699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391239#comment-15391239 ] Dongjoon Hyun commented on SPARK-16699: --- Hi, [~qifan]. Nice catch! By the way, usually, only committers set `FIX VERSION` field. You had better leave it blank next time. :) > Fix performance bug in hash aggregate on long string keys > - > > Key: SPARK-16699 > URL: https://issues.apache.org/jira/browse/SPARK-16699 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Qifan Pu > Fix For: 2.0.0 > > > In the following code in `VectorizedHashMapGenerator.scala`: > ``` > def hashBytes(b: String): String = { > val hash = ctx.freshName("hash") > s""" > |int $result = 0; > |for (int i = 0; i < $b.length; i++) { > | ${genComputeHash(ctx, s"$b[i]", ByteType, hash)} > | $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + > ($result >>> 2); > |} >""".stripMargin > } > ``` > when b=input.getBytes(), the current 2.0 code results in getBytes() being > called n times, n being length of input. getBytes() involves memory copy is > thus expensive and causes a performance degradation. > Fix is to evaluate getBytes() before the for loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16645) rename CatalogStorageFormat.serdeProperties to properties
[ https://issues.apache.org/jira/browse/SPARK-16645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-16645. - Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14283 [https://github.com/apache/spark/pull/14283] > rename CatalogStorageFormat.serdeProperties to properties > - > > Key: SPARK-16645 > URL: https://issues.apache.org/jira/browse/SPARK-16645 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16698) json parsing regression - "." in keys
[ https://issues.apache.org/jira/browse/SPARK-16698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391231#comment-15391231 ] Hyukjin Kwon commented on SPARK-16698: -- It seems it does not work for all `FileFormat` data sources. The code below also does not work. {code} test("SPARK-16698 - csv parsing regression - "." in keys") { withTempPath { path => val csv =""" |a.b |123 """.stripMargin spark.sparkContext .parallelize(csv :: Nil) .saveAsTextFile(path.getAbsolutePath) spark.read .option("inferSchema", "true") .option("header", "true") .csv(path.getAbsolutePath).select("`a.b`").show() } } {code} This is related with https://github.com/apache/spark/blob/37f3be5d29192db0a54f6c4699237b149bd0ecae/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L88-L89 > json parsing regression - "." in keys > - > > Key: SPARK-16698 > URL: https://issues.apache.org/jira/browse/SPARK-16698 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: TobiasP > > The commit 83775bc78e183791f75a99cdfbcd68a67ca0d472 "\[SPARK-14158]\[SQL] > implement buildReader for json data source" breaks parsing of json files with > "." in keys. > E.g. the test input for spark-solr > https://github.com/lucidworks/spark-solr/blob/master/src/test/resources/test-data/events.json > {noformat} > scala> > sqlContext.read.json("src/test/resources/test-data/events.json").collectAsList > org.apache.spark.sql.AnalysisException: Unable to resolve params.title_s > given [_version_, count_l, doc_id_s, flag_s, id, params.title_s, > params.url_s, session_id_s, timestamp_tdt, type_s, tz_timestamp_txt, > user_id_s]; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:130) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:126) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at org.apache.spark.sql.types.StructType.map(StructType.scala:94) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126) > at > org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:80) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) > at > org.apache.spark.sql.execution.SparkStrategies$SpecialLimits$.apply(SparkStrategies.scala:53) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:52) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:50) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:57) > at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2321) > at org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2040) > ... 49 elided > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional comma
[jira] [Commented] (SPARK-16698) json parsing regression - "." in keys
[ https://issues.apache.org/jira/browse/SPARK-16698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391217#comment-15391217 ] Hyukjin Kwon commented on SPARK-16698: -- FYI, this does not happen when it is read from json RDD. Let me please leave a code to reproduce that self-contains the issue. This does not work (in `JsonSuite.scala`) {code} test("SPARK-16698 - json parsing regression - "." in keys") { withTempPath { path => val json =""" {"a.b":"data"}""" spark.sparkContext .parallelize(json :: Nil) .saveAsTextFile(path.getAbsolutePath) spark.read.json(path.getAbsolutePath).collect() } } {code} This works {code} test("SPARK-16698 - json parsing regression - "." in keys") { withTempPath { path => val json =""" {"a.b":"data"}""" val rdd = spark.sparkContext .parallelize(json :: Nil) spark.read.json(rdd).collect() } } {code} > json parsing regression - "." in keys > - > > Key: SPARK-16698 > URL: https://issues.apache.org/jira/browse/SPARK-16698 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: TobiasP > > The commit 83775bc78e183791f75a99cdfbcd68a67ca0d472 "\[SPARK-14158]\[SQL] > implement buildReader for json data source" breaks parsing of json files with > "." in keys. > E.g. the test input for spark-solr > https://github.com/lucidworks/spark-solr/blob/master/src/test/resources/test-data/events.json > {noformat} > scala> > sqlContext.read.json("src/test/resources/test-data/events.json").collectAsList > org.apache.spark.sql.AnalysisException: Unable to resolve params.title_s > given [_version_, count_l, doc_id_s, flag_s, id, params.title_s, > params.url_s, session_id_s, timestamp_tdt, type_s, tz_timestamp_txt, > user_id_s]; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:130) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:126) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at org.apache.spark.sql.types.StructType.map(StructType.scala:94) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126) > at > org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:80) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) > at > org.apache.spark.sql.execution.SparkStrategies$SpecialLimits$.apply(SparkStrategies.scala:53) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:52) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:50) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:57) > at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2321) > at org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2040) > ... 49 elided > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For add
[jira] [Assigned] (SPARK-16699) Fix performance bug in hash aggregate on long string keys
[ https://issues.apache.org/jira/browse/SPARK-16699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16699: Assignee: Apache Spark > Fix performance bug in hash aggregate on long string keys > - > > Key: SPARK-16699 > URL: https://issues.apache.org/jira/browse/SPARK-16699 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Qifan Pu >Assignee: Apache Spark > Fix For: 2.0.0 > > > In the following code in `VectorizedHashMapGenerator.scala`: > ``` > def hashBytes(b: String): String = { > val hash = ctx.freshName("hash") > s""" > |int $result = 0; > |for (int i = 0; i < $b.length; i++) { > | ${genComputeHash(ctx, s"$b[i]", ByteType, hash)} > | $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + > ($result >>> 2); > |} >""".stripMargin > } > ``` > when b=input.getBytes(), the current 2.0 code results in getBytes() being > called n times, n being length of input. getBytes() involves memory copy is > thus expensive and causes a performance degradation. > Fix is to evaluate getBytes() before the for loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16699) Fix performance bug in hash aggregate on long string keys
[ https://issues.apache.org/jira/browse/SPARK-16699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391191#comment-15391191 ] Apache Spark commented on SPARK-16699: -- User 'ooq' has created a pull request for this issue: https://github.com/apache/spark/pull/14337 > Fix performance bug in hash aggregate on long string keys > - > > Key: SPARK-16699 > URL: https://issues.apache.org/jira/browse/SPARK-16699 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Qifan Pu > Fix For: 2.0.0 > > > In the following code in `VectorizedHashMapGenerator.scala`: > ``` > def hashBytes(b: String): String = { > val hash = ctx.freshName("hash") > s""" > |int $result = 0; > |for (int i = 0; i < $b.length; i++) { > | ${genComputeHash(ctx, s"$b[i]", ByteType, hash)} > | $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + > ($result >>> 2); > |} >""".stripMargin > } > ``` > when b=input.getBytes(), the current 2.0 code results in getBytes() being > called n times, n being length of input. getBytes() involves memory copy is > thus expensive and causes a performance degradation. > Fix is to evaluate getBytes() before the for loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16699) Fix performance bug in hash aggregate on long string keys
[ https://issues.apache.org/jira/browse/SPARK-16699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16699: Assignee: (was: Apache Spark) > Fix performance bug in hash aggregate on long string keys > - > > Key: SPARK-16699 > URL: https://issues.apache.org/jira/browse/SPARK-16699 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Qifan Pu > Fix For: 2.0.0 > > > In the following code in `VectorizedHashMapGenerator.scala`: > ``` > def hashBytes(b: String): String = { > val hash = ctx.freshName("hash") > s""" > |int $result = 0; > |for (int i = 0; i < $b.length; i++) { > | ${genComputeHash(ctx, s"$b[i]", ByteType, hash)} > | $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + > ($result >>> 2); > |} >""".stripMargin > } > ``` > when b=input.getBytes(), the current 2.0 code results in getBytes() being > called n times, n being length of input. getBytes() involves memory copy is > thus expensive and causes a performance degradation. > Fix is to evaluate getBytes() before the for loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16699) Fix performance bug in hash aggregate on long string keys
[ https://issues.apache.org/jira/browse/SPARK-16699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qifan Pu updated SPARK-16699: - Description: In the following code in `VectorizedHashMapGenerator.scala`: ``` def hashBytes(b: String): String = { val hash = ctx.freshName("hash") s""" |int $result = 0; |for (int i = 0; i < $b.length; i++) { | ${genComputeHash(ctx, s"$b[i]", ByteType, hash)} | $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + ($result >>> 2); |} """.stripMargin } ``` when b=input.getBytes(), the current 2.0 code results in getBytes() being called n times, n being length of input. getBytes() involves memory copy is thus expensive and causes a performance degradation. Fix is to evaluate getBytes() before the for loop. was: In the following code in `VectorizedHashMapGenerator.scala`: ``` def hashBytes(b: String): String = { val hash = ctx.freshName("hash") val bytes = ctx.freshName("bytes") s""" |int $result = 0; |byte[] $bytes = $b; |for (int i = 0; i < $bytes.length; i++) { | ${genComputeHash(ctx, s"$bytes[i]", ByteType, hash)} | $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + ($result >>> 2); |} """.stripMargin } ``` when b=input.getBytes(), the current 2.0 code results in getBytes() being called n times, n being length of input. getBytes() involves memory copy is thus expensive and causes a performance degradation. Fix is to evaluate getBytes() before the for loop. > Fix performance bug in hash aggregate on long string keys > - > > Key: SPARK-16699 > URL: https://issues.apache.org/jira/browse/SPARK-16699 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Qifan Pu > Fix For: 2.0.0 > > > In the following code in `VectorizedHashMapGenerator.scala`: > ``` > def hashBytes(b: String): String = { > val hash = ctx.freshName("hash") > s""" > |int $result = 0; > |for (int i = 0; i < $b.length; i++) { > | ${genComputeHash(ctx, s"$b[i]", ByteType, hash)} > | $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + > ($result >>> 2); > |} >""".stripMargin > } > ``` > when b=input.getBytes(), the current 2.0 code results in getBytes() being > called n times, n being length of input. getBytes() involves memory copy is > thus expensive and causes a performance degradation. > Fix is to evaluate getBytes() before the for loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16699) Fix performance bug in hash aggregate on long string keys
Qifan Pu created SPARK-16699: Summary: Fix performance bug in hash aggregate on long string keys Key: SPARK-16699 URL: https://issues.apache.org/jira/browse/SPARK-16699 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0 Reporter: Qifan Pu Fix For: 2.0.0 In the following code in `VectorizedHashMapGenerator.scala`: ``` def hashBytes(b: String): String = { val hash = ctx.freshName("hash") val bytes = ctx.freshName("bytes") s""" |int $result = 0; |byte[] $bytes = $b; |for (int i = 0; i < $bytes.length; i++) { | ${genComputeHash(ctx, s"$bytes[i]", ByteType, hash)} | $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + ($result >>> 2); |} """.stripMargin } ``` when b=input.getBytes(), the current 2.0 code results in getBytes() being called n times, n being length of input. getBytes() involves memory copy is thus expensive and causes a performance degradation. Fix is to evaluate getBytes() before the for loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16685) audit release docs are ambiguous
[ https://issues.apache.org/jira/browse/SPARK-16685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391168#comment-15391168 ] Patrick Wendell commented on SPARK-16685: - These scripts are pretty old and I'm not sure if anyone still uses them. I had written them a while back as sanity tests for some release builds. Today, those things are tested broadly by the community so I think this has become redundant. [~rxin] are these still used? If not, it might be good to remove them from the source repo. > audit release docs are ambiguous > > > Key: SPARK-16685 > URL: https://issues.apache.org/jira/browse/SPARK-16685 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 1.6.2 >Reporter: jay vyas >Priority: Minor > > The dev/audit-release tooling is ambiguous. > - should it run against a real cluster? if so when? > - what should be in the release repo? Just jars? tarballs? ( i assume jars > because its a .ivy, but not sure). > - > https://github.com/apache/spark/tree/master/dev/audit-release -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16698) json parsing regression - "." in keys
TobiasP created SPARK-16698: --- Summary: json parsing regression - "." in keys Key: SPARK-16698 URL: https://issues.apache.org/jira/browse/SPARK-16698 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: TobiasP The commit 83775bc78e183791f75a99cdfbcd68a67ca0d472 "\[SPARK-14158]\[SQL] implement buildReader for json data source" breaks parsing of json files with "." in keys. E.g. the test input for spark-solr https://github.com/lucidworks/spark-solr/blob/master/src/test/resources/test-data/events.json {noformat} scala> sqlContext.read.json("src/test/resources/test-data/events.json").collectAsList org.apache.spark.sql.AnalysisException: Unable to resolve params.title_s given [_version_, count_l, doc_id_s, flag_s, id, params.title_s, params.url_s, session_id_s, timestamp_tdt, type_s, tz_timestamp_txt, user_id_s]; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:130) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:126) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.Iterator$class.foreach(Iterator.scala:742) at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at org.apache.spark.sql.types.StructType.map(StructType.scala:94) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126) at org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:80) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$SpecialLimits$.apply(SparkStrategies.scala:53) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:52) at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:50) at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:57) at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:57) at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2321) at org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2040) ... 49 elided {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16589) Chained cartesian produces incorrect number of records
[ https://issues.apache.org/jira/browse/SPARK-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391108#comment-15391108 ] Maciej Szymkiewicz commented on SPARK-16589: [~holdenk] Makes sense. I was thinking more about design than other possible issues but it is probably better safe than sorry. It still should be fixed as fast as possible. It is really ugly bug and is easy to miss. I doubt there are many legitimate cases when one can do something like this though (I guess this is why it hasn't been reported before). > Chained cartesian produces incorrect number of records > -- > > Key: SPARK-16589 > URL: https://issues.apache.org/jira/browse/SPARK-16589 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz > > Chaining cartesian calls in PySpark results in the number of records lower > than expected. It can be reproduced as follows: > {code} > rdd = sc.parallelize(range(10), 1) > rdd.cartesian(rdd).cartesian(rdd).count() > ## 355 > rdd.cartesian(rdd).cartesian(rdd).distinct().count() > ## 251 > {code} > It looks like it is related to serialization. If we reserialize after initial > cartesian: > {code} > rdd.cartesian(rdd)._reserialize(BatchedSerializer(PickleSerializer(), > 1)).cartesian(rdd).count() > ## 1000 > {code} > or insert identity map: > {code} > rdd.cartesian(rdd).map(lambda x: x).cartesian(rdd).count() > ## 1000 > {code} > it yields correct results. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12378) CREATE EXTERNAL TABLE AS SELECT EXPORT AWS S3 ERROR
[ https://issues.apache.org/jira/browse/SPARK-12378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15387889#comment-15387889 ] Chandana Sapparapu edited comment on SPARK-12378 at 7/24/16 3:46 PM: - UPDATE - For temporary table, I don't have this issue. For external table - the following syntax seems to work. FROM (select a.* ,b.* from TBL_A a , TBL_B b where a.id=b.id) INPUT insert overwrite TABLE MY_EXTERNAL_TABLE PARTITION (PARTITION_DT='2016-07-24' , SUB_PARTITION_DT='2016-07-24') select * where flg_1='N'; -- Environment: Spark 1.6.2 AWS EMR 4.7.2 Hive execution 1.2.1 Metastore: 1.0.0 Query : insert into table (managed) select a.*, b.* from external_table1 a LEFT OUTER JOIN external_table2 b on a.id=b.id; The statement succeeds and some data is written into this managed table, but it throws the following exception which stops the subsequent statements from getting executed. java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:442) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply$mcV$sp(ClientWrapper.scala:557) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply(ClientWrapper.scala:557) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply(ClientWrapper.scala:557) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:290) at org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:237) at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:236) at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:279) at org.apache.spark.sql.hive.client.ClientWrapper.loadTable(ClientWrapper.scala:556) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:256) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) at org.apache.spark.sql.DataFrame.(DataFrame.scala:145) at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:311) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table. Invalid method name: 'alter_table_with_cascade' at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:50
[jira] [Commented] (SPARK-16695) compile spark with scala 2.11 is not working (with Sbt)
[ https://issues.apache.org/jira/browse/SPARK-16695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391088#comment-15391088 ] Sean Owen commented on SPARK-16695: --- `sbt -Dscala-2.11 package` works for 1.6.2 on my Mac. So does `dependencyTree`. The plugin is not in namespace net.virtualvoid but net.virtual-void. I imagine you have some other problem that's not Spark related? or at least, you haven't reported an error here. > compile spark with scala 2.11 is not working (with Sbt) > --- > > Key: SPARK-16695 > URL: https://issues.apache.org/jira/browse/SPARK-16695 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.6.2 > Environment: OS: Mac >Reporter: Eran Mizrahi > > tried to compile spark 1.6.2 with sbt for Scala 2.11, but got compilation > error. > it seems that there is a wrong import in file Project/SparkBuild.scala. > instead of "net.virtualvoid.sbt.graph.Plugin.graphSettings", > it should be "net.virtualvoid.sbt.graph. > DependencyGraphSettings.graphSettings". > Checkout source code of "sbt-dependency-graph" plugin: > https://github.com/jrudolph/sbt-dependency-graph/blob/master/src/main/scala/net/virtualvoid/sbt/graph/DependencyGraphSettings.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3246) Support weighted SVMWithSGD for classification of unbalanced dataset
[ https://issues.apache.org/jira/browse/SPARK-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391082#comment-15391082 ] Mohamed Baddar commented on SPARK-3246: --- [~sheridanrawlins] Working on it soon, most probably on 1st of August > Support weighted SVMWithSGD for classification of unbalanced dataset > > > Key: SPARK-3246 > URL: https://issues.apache.org/jira/browse/SPARK-3246 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 0.9.0, 1.0.2 >Reporter: mahesh bhole > > Please support weighted SVMWithSGD for binary classification of unbalanced > dataset.Though other options like undersampling or oversampling can be > used,It will be good if we can have a way to assign weights to minority > class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16696) unused broadcast variables should call destroy instead of unpersist
[ https://issues.apache.org/jira/browse/SPARK-16696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-16696: --- Issue Type: Improvement (was: Bug) > unused broadcast variables should call destroy instead of unpersist > --- > > Key: SPARK-16696 > URL: https://issues.apache.org/jira/browse/SPARK-16696 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.0.1, 2.1.0 >Reporter: Weichen Xu > Original Estimate: 1m > Remaining Estimate: 1m > > Unused broadcast variables should call destroy() instead of unpersist() so > that the memory can released in time, even in driver-side. > currently, several algorithm in ML, such as KMeans, Word2Vec, has this > problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16697) redundant RDD computation in LDAOptimizer
[ https://issues.apache.org/jira/browse/SPARK-16697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16697: Assignee: Apache Spark > redundant RDD computation in LDAOptimizer > - > > Key: SPARK-16697 > URL: https://issues.apache.org/jira/browse/SPARK-16697 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.0.1, 2.1.0 >Reporter: Weichen Xu >Assignee: Apache Spark > Original Estimate: 24h > Remaining Estimate: 24h > > In mllib.clustering.LDAOptimizer > the submitMiniBatch method, > the stats: RDD do not persist but the following code will use it twice. > so it cause redundant computation on it. > and there is another problem, > the expElogbetaBc broadcast variable is unpersist too early, > and the next statement > ` > val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat(val gammat: > BDM[Double] = breeze.linalg.DenseMatrix.vertcat( >stats.map(_._2).flatMap(list => list).collect().map(_.toDenseMatrix): > _*) > ` > will re-compute the stats RDD, it will use expElogbetaBc broadcast variable > again, > so the expElogbetaBc broadcast variable will be broadcast again. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16697) redundant RDD computation in LDAOptimizer
[ https://issues.apache.org/jira/browse/SPARK-16697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391067#comment-15391067 ] Apache Spark commented on SPARK-16697: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/14335 > redundant RDD computation in LDAOptimizer > - > > Key: SPARK-16697 > URL: https://issues.apache.org/jira/browse/SPARK-16697 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.0.1, 2.1.0 >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > In mllib.clustering.LDAOptimizer > the submitMiniBatch method, > the stats: RDD do not persist but the following code will use it twice. > so it cause redundant computation on it. > and there is another problem, > the expElogbetaBc broadcast variable is unpersist too early, > and the next statement > ` > val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat(val gammat: > BDM[Double] = breeze.linalg.DenseMatrix.vertcat( >stats.map(_._2).flatMap(list => list).collect().map(_.toDenseMatrix): > _*) > ` > will re-compute the stats RDD, it will use expElogbetaBc broadcast variable > again, > so the expElogbetaBc broadcast variable will be broadcast again. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16697) redundant RDD computation in LDAOptimizer
[ https://issues.apache.org/jira/browse/SPARK-16697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16697: Assignee: (was: Apache Spark) > redundant RDD computation in LDAOptimizer > - > > Key: SPARK-16697 > URL: https://issues.apache.org/jira/browse/SPARK-16697 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.0.1, 2.1.0 >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > In mllib.clustering.LDAOptimizer > the submitMiniBatch method, > the stats: RDD do not persist but the following code will use it twice. > so it cause redundant computation on it. > and there is another problem, > the expElogbetaBc broadcast variable is unpersist too early, > and the next statement > ` > val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat(val gammat: > BDM[Double] = breeze.linalg.DenseMatrix.vertcat( >stats.map(_._2).flatMap(list => list).collect().map(_.toDenseMatrix): > _*) > ` > will re-compute the stats RDD, it will use expElogbetaBc broadcast variable > again, > so the expElogbetaBc broadcast variable will be broadcast again. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16697) redundant RDD computation in LDAOptimizer
[ https://issues.apache.org/jira/browse/SPARK-16697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-16697: --- Description: In mllib.clustering.LDAOptimizer the submitMiniBatch method, the stats: RDD do not persist but the following code will use it twice. so it cause redundant computation on it. and there is another problem, the expElogbetaBc broadcast variable is unpersist too early, and the next statement ` val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat(val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat( stats.map(_._2).flatMap(list => list).collect().map(_.toDenseMatrix): _*) ` will re-compute the stats RDD, it will use expElogbetaBc broadcast variable again, so the expElogbetaBc broadcast variable will be broadcast again. was: In mllib.clustering.LDAOptimizer the submitMiniBatch method, the stats: RDD do not persist but the following code will use it twice. so it cause redundant computation on it. > redundant RDD computation in LDAOptimizer > - > > Key: SPARK-16697 > URL: https://issues.apache.org/jira/browse/SPARK-16697 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.0.1, 2.1.0 >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > In mllib.clustering.LDAOptimizer > the submitMiniBatch method, > the stats: RDD do not persist but the following code will use it twice. > so it cause redundant computation on it. > and there is another problem, > the expElogbetaBc broadcast variable is unpersist too early, > and the next statement > ` > val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat(val gammat: > BDM[Double] = breeze.linalg.DenseMatrix.vertcat( >stats.map(_._2).flatMap(list => list).collect().map(_.toDenseMatrix): > _*) > ` > will re-compute the stats RDD, it will use expElogbetaBc broadcast variable > again, > so the expElogbetaBc broadcast variable will be broadcast again. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16697) redundant RDD computation in LDAOptimizer
Weichen Xu created SPARK-16697: -- Summary: redundant RDD computation in LDAOptimizer Key: SPARK-16697 URL: https://issues.apache.org/jira/browse/SPARK-16697 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.0.1, 2.1.0 Reporter: Weichen Xu In mllib.clustering.LDAOptimizer the submitMiniBatch method, the stats: RDD do not persist but the following code will use it twice. so it cause redundant computation on it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16696) unused broadcast variables should call destroy instead of unpersist
[ https://issues.apache.org/jira/browse/SPARK-16696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-16696: --- Description: Unused broadcast variables should call destroy() instead of unpersist() so that the memory can released in time, even in driver-side. currently, several algorithm in ML, such as KMeans, Word2Vec, has this problem. was: Unused broadcast variables should call destroy() instead of unpersist() so that the memory can released in time, even in driver-side. currently, several algorithm in ML, such as KMeans, LDA, Word2Vec, has this problem. Several places unused broadcast vars even neither call unpersit() nor call destroy() > unused broadcast variables should call destroy instead of unpersist > --- > > Key: SPARK-16696 > URL: https://issues.apache.org/jira/browse/SPARK-16696 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.1, 2.1.0 >Reporter: Weichen Xu > Original Estimate: 1m > Remaining Estimate: 1m > > Unused broadcast variables should call destroy() instead of unpersist() so > that the memory can released in time, even in driver-side. > currently, several algorithm in ML, such as KMeans, Word2Vec, has this > problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16696) unused broadcast variables should call destroy instead of unpersist
[ https://issues.apache.org/jira/browse/SPARK-16696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391050#comment-15391050 ] Apache Spark commented on SPARK-16696: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/14333 > unused broadcast variables should call destroy instead of unpersist > --- > > Key: SPARK-16696 > URL: https://issues.apache.org/jira/browse/SPARK-16696 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.1, 2.1.0 >Reporter: Weichen Xu > Original Estimate: 1m > Remaining Estimate: 1m > > Unused broadcast variables should call destroy() instead of unpersist() so > that the memory can released in time, even in driver-side. > currently, several algorithm in ML, such as KMeans, LDA, Word2Vec, has this > problem. Several places unused broadcast vars even neither call unpersit() > nor call destroy() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16696) unused broadcast variables should call destroy instead of unpersist
[ https://issues.apache.org/jira/browse/SPARK-16696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16696: Assignee: (was: Apache Spark) > unused broadcast variables should call destroy instead of unpersist > --- > > Key: SPARK-16696 > URL: https://issues.apache.org/jira/browse/SPARK-16696 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.1, 2.1.0 >Reporter: Weichen Xu > Original Estimate: 1m > Remaining Estimate: 1m > > Unused broadcast variables should call destroy() instead of unpersist() so > that the memory can released in time, even in driver-side. > currently, several algorithm in ML, such as KMeans, LDA, Word2Vec, has this > problem. Several places unused broadcast vars even neither call unpersit() > nor call destroy() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16696) unused broadcast variables should call destroy instead of unpersist
[ https://issues.apache.org/jira/browse/SPARK-16696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16696: Assignee: Apache Spark > unused broadcast variables should call destroy instead of unpersist > --- > > Key: SPARK-16696 > URL: https://issues.apache.org/jira/browse/SPARK-16696 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.1, 2.1.0 >Reporter: Weichen Xu >Assignee: Apache Spark > Original Estimate: 1m > Remaining Estimate: 1m > > Unused broadcast variables should call destroy() instead of unpersist() so > that the memory can released in time, even in driver-side. > currently, several algorithm in ML, such as KMeans, LDA, Word2Vec, has this > problem. Several places unused broadcast vars even neither call unpersit() > nor call destroy() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16696) unused broadcast variables should call destroy instead of unpersist
Weichen Xu created SPARK-16696: -- Summary: unused broadcast variables should call destroy instead of unpersist Key: SPARK-16696 URL: https://issues.apache.org/jira/browse/SPARK-16696 Project: Spark Issue Type: Bug Components: ML, MLlib Affects Versions: 2.0.1, 2.1.0 Reporter: Weichen Xu Unused broadcast variables should call destroy() instead of unpersist() so that the memory can released in time, even in driver-side. currently, several algorithm in ML, such as KMeans, LDA, Word2Vec, has this problem. Several places unused broadcast vars even neither call unpersit() nor call destroy() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16695) compile spark with scala 2.11 is not working (with Sbt)
Eran Mizrahi created SPARK-16695: Summary: compile spark with scala 2.11 is not working (with Sbt) Key: SPARK-16695 URL: https://issues.apache.org/jira/browse/SPARK-16695 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.6.2 Environment: OS: Mac Reporter: Eran Mizrahi tried to compile spark 1.6.2 with sbt for Scala 2.11, but got compilation error. it seems that there is a wrong import in file Project/SparkBuild.scala. instead of "net.virtualvoid.sbt.graph.Plugin.graphSettings", it should be "net.virtualvoid.sbt.graph. DependencyGraphSettings.graphSettings". Checkout source code of "sbt-dependency-graph" plugin: https://github.com/jrudolph/sbt-dependency-graph/blob/master/src/main/scala/net/virtualvoid/sbt/graph/DependencyGraphSettings.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16416) Logging in shutdown hook does not work properly with Log4j 2.x
[ https://issues.apache.org/jira/browse/SPARK-16416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16416: -- Assignee: Mikael Ståldal > Logging in shutdown hook does not work properly with Log4j 2.x > -- > > Key: SPARK-16416 > URL: https://issues.apache.org/jira/browse/SPARK-16416 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2 >Reporter: Mikael Ståldal >Assignee: Mikael Ståldal >Priority: Minor > Fix For: 2.1.0 > > > Spark registers some shutdown hooks, and they log messages during shutdown: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/ShutdownHookManager.scala#L58 > Since the {{Logging}} trait creates SLF4J loggers lazily: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala#L47 > a SLF4J logger is created during the execution of the shutdown hook. > This does not work when Log4j 2.x is used as SLF4J implementation: > https://issues.apache.org/jira/browse/LOG4J2-1222 > Even though Log4j 2.6 handles this more gracefully than before, it still does > emit a warning and will not be able to process the log message properly. > Proposed solution: make sure to eagerly create the SLF4J logger to be used in > shutdown hooks when registering the hook. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16416) Logging in shutdown hook does not work properly with Log4j 2.x
[ https://issues.apache.org/jira/browse/SPARK-16416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16416. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14320 [https://github.com/apache/spark/pull/14320] > Logging in shutdown hook does not work properly with Log4j 2.x > -- > > Key: SPARK-16416 > URL: https://issues.apache.org/jira/browse/SPARK-16416 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2 >Reporter: Mikael Ståldal >Priority: Minor > Fix For: 2.1.0 > > > Spark registers some shutdown hooks, and they log messages during shutdown: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/ShutdownHookManager.scala#L58 > Since the {{Logging}} trait creates SLF4J loggers lazily: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala#L47 > a SLF4J logger is created during the execution of the shutdown hook. > This does not work when Log4j 2.x is used as SLF4J implementation: > https://issues.apache.org/jira/browse/LOG4J2-1222 > Even though Log4j 2.6 handles this more gracefully than before, it still does > emit a warning and will not be able to process the log message properly. > Proposed solution: make sure to eagerly create the SLF4J logger to be used in > shutdown hooks when registering the hook. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16667) Spark driver executor dont release unused memory
[ https://issues.apache.org/jira/browse/SPARK-16667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16667: -- Target Version/s: (was: 1.6.0) > Spark driver executor dont release unused memory > > > Key: SPARK-16667 > URL: https://issues.apache.org/jira/browse/SPARK-16667 > Project: Spark > Issue Type: Bug > Components: GraphX, Spark Core >Affects Versions: 1.6.0 > Environment: Ubuntu wily 64 bits > java 1.8 > 3 slaves(4GB) 1 master(2GB) virtual machines in Vmware over i7 4th generation > with 16 gb RAM) >Reporter: Luis Angel Hernández Acosta > > I'm running spark app in standalone cluster. My app create sparkContext and > make many calculation with graphx over the time. To calculate, my app create > new java thread and wait for it's ending signal. Betwenn any calculation, > memory grows 50mb-100mb. I make a thread to be sure that any object created > for calculate is destryed after calculate's end, but memory still growing. I > tray stoping the sparkContext and all executor memory allocated by app is > freed but my driver's memory still growing same 50m-100mb. > My graph calculaiton include hdfs seralization of rdd and load graph from hdfs > Spark env: > export SPARK_MASTER_IP=master > export SPARK_WORKER_CORES=4 > export SPARK_WORKER_MEMORY=2919m > export SPARK_WORKER_INSTANCES=1 > export SPARK_DAEMON_MEMORY=256m > export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true > -Dspark.worker.cleanup.interval=10" > That are my only configurations -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16692) multilabel classification to DataFrame, ML
[ https://issues.apache.org/jira/browse/SPARK-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16692: -- Target Version/s: (was: 1.6.0) > multilabel classification to DataFrame, ML > --- > > Key: SPARK-16692 > URL: https://issues.apache.org/jira/browse/SPARK-16692 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Weizhi Li >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > For the multi labels evaluations. There is a method to in MLlib named > MultilabelMetrics: A multilabel classification problem involves mapping each > sample in a dataset to a set of class labels. In this type of classification > problem, the labels are not mutually exclusive. For example, when classifying > a set of news articles into topics, a single article might be both science > and politics. > Added this method to support DataFrame in ML. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16463) Support `truncate` option in Overwrite mode for JDBC DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-16463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16463: -- Assignee: Dongjoon Hyun > Support `truncate` option in Overwrite mode for JDBC DataFrameWriter > - > > Key: SPARK-16463 > URL: https://issues.apache.org/jira/browse/SPARK-16463 > Project: Spark > Issue Type: Improvement >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.1.0 > > > This issue adds a boolean option, `truncate`, for SaveMode.Overwrite of JDBC > DataFrameWriter. If this option is `true`, it use `TRUNCATE TABLE` instead of > `DROP TABLE`. > - Without CREATE/DROP privilege, we can save dataframe to database. Sometime > these are not allowed for security. > - It will keep the existing table information, so users can add and keep some > additional CONSTRAINTs for the table. > - Sometime, TRUNCATE is faster than the combination of DROP/CREATE. > This issue is different from SPARK-16410 which aims to use `TRUNCATE` only > for JDBC sources. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16410) DataFrameWriter's jdbc method drops table in overwrite mode
[ https://issues.apache.org/jira/browse/SPARK-16410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16410. --- Resolution: Duplicate > DataFrameWriter's jdbc method drops table in overwrite mode > --- > > Key: SPARK-16410 > URL: https://issues.apache.org/jira/browse/SPARK-16410 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.6.2 >Reporter: Ian Hellstrom > > According to the [API > documentation|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter], > the write mode {{overwrite}} should _overwrite the existing data_, which > suggests that the data is removed, i.e. the table is truncated. > However, that is now what happens in the [source > code|https://github.com/apache/spark/blob/0ad6ce7e54b1d8f5946dde652fa5341d15059158/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L421]: > {code} > if (mode == SaveMode.Overwrite && tableExists) { > JdbcUtils.dropTable(conn, table) > tableExists = false > } > {code} > This clearly shows that the table is first dropped and then recreated. This > causes two major issues: > * Existing indexes, partitioning schemes, etc. are completely lost. > * The case of identifiers may be changed without the user understanding why. > In my opinion, the table should be truncated, not dropped. Overwriting data > is a DML operation and should not cause DDL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16463) Support `truncate` option in Overwrite mode for JDBC DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-16463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16463. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14086 [https://github.com/apache/spark/pull/14086] > Support `truncate` option in Overwrite mode for JDBC DataFrameWriter > - > > Key: SPARK-16463 > URL: https://issues.apache.org/jira/browse/SPARK-16463 > Project: Spark > Issue Type: Improvement >Reporter: Dongjoon Hyun >Priority: Minor > Fix For: 2.1.0 > > > This issue adds a boolean option, `truncate`, for SaveMode.Overwrite of JDBC > DataFrameWriter. If this option is `true`, it use `TRUNCATE TABLE` instead of > `DROP TABLE`. > - Without CREATE/DROP privilege, we can save dataframe to database. Sometime > these are not allowed for security. > - It will keep the existing table information, so users can add and keep some > additional CONSTRAINTs for the table. > - Sometime, TRUNCATE is faster than the combination of DROP/CREATE. > This issue is different from SPARK-16410 which aims to use `TRUNCATE` only > for JDBC sources. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16685) audit release docs are ambiguous
[ https://issues.apache.org/jira/browse/SPARK-16685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15390990#comment-15390990 ] Sean Owen commented on SPARK-16685: --- [~pwendell] I think you put this in place; do you know if {{audit-release}} is used anymore? it isn't invoked by any of the Spark machinery as far as I can tell, but maybe something Jenkins runs or that you had run in the past with a release. [~rxin] has touched this recently. > audit release docs are ambiguous > > > Key: SPARK-16685 > URL: https://issues.apache.org/jira/browse/SPARK-16685 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 1.6.2 >Reporter: jay vyas >Priority: Minor > > The dev/audit-release tooling is ambiguous. > - should it run against a real cluster? if so when? > - what should be in the release repo? Just jars? tarballs? ( i assume jars > because its a .ivy, but not sure). > - > https://github.com/apache/spark/tree/master/dev/audit-release -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16541) SparkTC application could not shutdown successfully
[ https://issues.apache.org/jira/browse/SPARK-16541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16541. --- Resolution: Cannot Reproduce > SparkTC application could not shutdown successfully > > > Key: SPARK-16541 > URL: https://issues.apache.org/jira/browse/SPARK-16541 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Yesha Vora > > SparkTC application in yarn-client mode was stuck at 10% progress. > {code} spark-submit --class org.apache.spark.examples.SparkTC --master > yarn-client spark-examples-assembly_*.jar {code} > It seems like SparkTC application tasks finished and printed "TC has 6254 > edges.". after that while shutting down, spark application kept getting > "ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event > SparkListenerExecutorMetricsUpdate" > {code} > 16/07/13 08:43:37 INFO DAGScheduler: ResultStage 283 (count at > SparkTC.scala:71) finished in 42.357 s > 16/07/13 08:43:37 INFO DAGScheduler: Job 13 finished: count at > SparkTC.scala:71, took 43.137408 s > TC has 6254 edges. > 16/07/13 08:43:37 INFO ServerConnector: Stopped > ServerConnector@5e0054a2{HTTP/1.1}{0.0.0.0:4040} > 16/07/13 08:43:37 INFO ContextHandler: Stopped > o.s.j.s.ServletContextHandler@7350a22{/stages/stage/kill,null,UNAVAILABLE} > 16/07/13 08:43:37 INFO ContextHandler: Stopped > o.s.j.s.ServletContextHandler@54d56a49{/api,null,UNAVAILABLE} > 16/07/13 08:43:37 INFO ContextHandler: Stopped > o.s.j.s.ServletContextHandler@2fd52e57{/,null,UNAVAILABLE} > 16/07/13 08:43:37 INFO ContextHandler: Stopped > o.s.j.s.ServletContextHandler@7737ff3{/static,null,UNAVAILABLE} > 16/07/13 08:43:37 INFO ContextHandler: Stopped > o.s.j.s.ServletContextHandler@499d9067{/executors/threadDump/json,null,UNAVAILABLE} > 16/07/13 08:43:37 INFO ContextHandler: Stopped > o.s.j.s.ServletContextHandler@40d0c2af{/executors/threadDump,null,UNAVAILABLE} > 16/07/13 08:43:37 INFO ContextHandler: Stopped > o.s.j.s.ServletContextHandler@44ce4013{/executors/json,null,UNAVAILABLE} > 16/07/13 08:43:37 INFO ContextHandler: Stopped > o.s.j.s.ServletContextHandler@59c9a28a{/executors,null,UNAVAILABLE} > 16/07/13 08:43:37 INFO ContextHandler: Stopped > o.s.j.s.ServletContextHandler@2e784443{/environment/json,null,UNAVAILABLE} > 16/07/13 08:43:37 INFO ContextHandler: Stopped > o.s.j.s.ServletContextHandler@10240ba4{/environment,null,UNAVAILABLE} > 16/07/13 08:43:37 INFO ContextHandler: Stopped > o.s.j.s.ServletContextHandler@4ee2dd22{/storage/rdd/json,null,UNAVAILABLE} > 16/07/13 08:43:37 INFO ContextHandler: Stopped > o.s.j.s.ServletContextHandler@79ab14cd{/storage/rdd,null,UNAVAILABLE} > 16/07/13 08:43:37 INFO ContextHandler: Stopped > o.s.j.s.ServletContextHandler@731d1285{/storage/json,null,UNAVAILABLE} > 16/07/13 08:43:37 INFO ContextHandler: Stopped > o.s.j.s.ServletContextHandler@72e46ea8{/storage,null,UNAVAILABLE} > 16/07/13 08:43:37 INFO ContextHandler: Stopped > o.s.j.s.ServletContextHandler@266dcdd5{/stages/pool/json,null,UNAVAILABLE} > 16/07/13 08:43:37 INFO ContextHandler: Stopped > o.s.j.s.ServletContextHandler@17ee6dd9{/stages/pool,null,UNAVAILABLE} > 16/07/13 08:43:37 INFO ContextHandler: Stopped > o.s.j.s.ServletContextHandler@717867ea{/stages/stage/json,null,UNAVAILABLE} > 16/07/13 08:43:37 INFO ContextHandler: Stopped > o.s.j.s.ServletContextHandler@74aaadcc{/stages/stage,null,UNAVAILABLE} > 16/07/13 08:43:37 INFO ContextHandler: Stopped > o.s.j.s.ServletContextHandler@14f35a42{/stages/json,null,UNAVAILABLE} > 16/07/13 08:43:37 INFO ContextHandler: Stopped > o.s.j.s.ServletContextHandler@27ec74f8{/stages,null,UNAVAILABLE} > 16/07/13 08:43:37 INFO ContextHandler: Stopped > o.s.j.s.ServletContextHandler@148ad9f9{/jobs/job/json,null,UNAVAILABLE} > 16/07/13 08:43:37 INFO ContextHandler: Stopped > o.s.j.s.ServletContextHandler@14445e4c{/jobs/job,null,UNAVAILABLE} > 16/07/13 08:43:37 INFO ContextHandler: Stopped > o.s.j.s.ServletContextHandler@6d1557ff{/jobs/json,null,UNAVAILABLE} > 16/07/13 08:43:37 INFO ContextHandler: Stopped > o.s.j.s.ServletContextHandler@aca62b1{/jobs,null,UNAVAILABLE} > 16/07/13 08:43:37 INFO SparkUI: Stopped Spark web UI at http://xx.xx.xx:4040 > 16/07/13 08:43:37 ERROR LiveListenerBus: SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(2,WrappedArray()) > 16/07/13 08:43:37 ERROR LiveListenerBus: SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(1,WrappedArray()) > 16/07/13 08:43:47 ERROR LiveListenerBus: SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(2,WrappedArray()) > 16/07/13 08:43:47 ERROR LiveListenerBus: SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(1,WrappedArray()) >
[jira] [Updated] (SPARK-16664) Spark 1.6.2 - Persist call on Data frames with more than 200 columns is wiping out the data.
[ https://issues.apache.org/jira/browse/SPARK-16664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16664: -- Target Version/s: 2.0.1 > Spark 1.6.2 - Persist call on Data frames with more than 200 columns is > wiping out the data. > > > Key: SPARK-16664 > URL: https://issues.apache.org/jira/browse/SPARK-16664 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2 >Reporter: Satish Kolli >Priority: Blocker > > Calling persist on a data frame with more than 200 columns is removing the > data from the data frame. This is an issue in Spark 1.6.2. Works with out any > issues in Spark 1.6.1 > Following test case demonstrates problem. Please let me know if you need any > additional information. Thanks. > {code} > import org.apache.spark._ > import org.apache.spark.rdd.RDD > import org.apache.spark.sql.types._ > import org.apache.spark.sql.{Row, SQLContext} > import org.scalatest.FunSuite > class TestSpark162_1 extends FunSuite { > test("test data frame with 200 columns") { > val sparkConfig = new SparkConf() > val parallelism = 5 > sparkConfig.set("spark.default.parallelism", s"$parallelism") > sparkConfig.set("spark.sql.shuffle.partitions", s"$parallelism") > val sc = new SparkContext(s"local[3]", "TestNestedJson", sparkConfig) > val sqlContext = new SQLContext(sc) > // create dataframe with 200 columns and fake 200 values > val size = 200 > val rdd: RDD[Seq[Long]] = sc.parallelize(Seq(Seq.range(0, size))) > val rowRdd: RDD[Row] = rdd.map(d => Row.fromSeq(d)) > val schemas = List.range(0, size).map(a => StructField("name"+ a, > LongType, true)) > val testSchema = StructType(schemas) > val testDf = sqlContext.createDataFrame(rowRdd, testSchema) > // test value > assert(testDf.persist.take(1).apply(0).toSeq(100).asInstanceOf[Long] == > 100) > sc.stop() > } > test("test data frame with 201 columns") { > val sparkConfig = new SparkConf() > val parallelism = 5 > sparkConfig.set("spark.default.parallelism", s"$parallelism") > sparkConfig.set("spark.sql.shuffle.partitions", s"$parallelism") > val sc = new SparkContext(s"local[3]", "TestNestedJson", sparkConfig) > val sqlContext = new SQLContext(sc) > // create dataframe with 201 columns and fake 201 values > val size = 201 > val rdd: RDD[Seq[Long]] = sc.parallelize(Seq(Seq.range(0, size))) > val rowRdd: RDD[Row] = rdd.map(d => Row.fromSeq(d)) > val schemas = List.range(0, size).map(a => StructField("name"+ a, > LongType, true)) > val testSchema = StructType(schemas) > val testDf = sqlContext.createDataFrame(rowRdd, testSchema) > // test value > assert(testDf.persist.take(1).apply(0).toSeq(100).asInstanceOf[Long] == > 100) > sc.stop() > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16676) Spark jobs stay in pending
[ https://issues.apache.org/jira/browse/SPARK-16676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16676. --- Resolution: Not A Problem If your executors aren't starting, then you have another earlier problem. You'd have to look at your logs to see why. For example, maybe you don't have enough resources available. > Spark jobs stay in pending > -- > > Key: SPARK-16676 > URL: https://issues.apache.org/jira/browse/SPARK-16676 > Project: Spark > Issue Type: Bug > Components: MLlib, Spark Shell >Affects Versions: 1.5.2 > Environment: Mac OS X Yosemite, Terminal, Spark-shell standalone >Reporter: Joe Chong > Attachments: Spark UI stays @ pending.png > > > I've been having issues executing certain Scala statements within the > Spark-Shell. These statements are obtained through tutorial/blog written by > Carol McDonald in MapR. > The import statements, reading text files into DataFrames are OK. However, > when I try to do df.show(), the execution hits a road block. Checking the > Spark UI job, I see that the Stage's active, however, 1 of its dependent job > stays in Pending without any movement. The logs are as below. > scala> fltCountsql.show() > 16/07/22 11:40:16 INFO spark.SparkContext: Starting job: show at :46 > 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Registering RDD 31 (show at > :46) > 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Got job 4 (show at > :46) with 200 output partitions > 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Final stage: ResultStage > 8(show at :46) > 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Parents of final stage: > List(ShuffleMapStage 7) > 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Missing parents: > List(ShuffleMapStage 7) > 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 7 > (MapPartitionsRDD[31] at show at :46), which has no missing parents > 16/07/22 11:40:16 INFO storage.MemoryStore: ensureFreeSpace(18128) called > with curMem=115755879, maxMem=2778495713 > 16/07/22 11:40:16 INFO storage.MemoryStore: Block broadcast_5 stored as > values in memory (estimated size 17.7 KB, free 2.5 GB) > 16/07/22 11:40:16 INFO storage.MemoryStore: ensureFreeSpace(7527) called with > curMem=115774007, maxMem=2778495713 > 16/07/22 11:40:16 INFO storage.MemoryStore: Block broadcast_5_piece0 stored > as bytes in memory (estimated size 7.4 KB, free 2.5 GB) > 16/07/22 11:40:16 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in > memory on localhost:61408 (size: 7.4 KB, free: 2.5 GB) > 16/07/22 11:40:16 INFO spark.SparkContext: Created broadcast 5 from broadcast > at DAGScheduler.scala:861 > 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Submitting 2 missing tasks > from ShuffleMapStage 7 (MapPartitionsRDD[31] at show at :46) > 16/07/22 11:40:16 INFO scheduler.TaskSchedulerImpl: Adding task set 7.0 with > 2 tasks > 16/07/22 11:40:16 INFO scheduler.TaskSetManager: Starting task 0.0 in stage > 7.0 (TID 4, localhost, PROCESS_LOCAL, 2156 bytes) > 16/07/22 11:40:16 INFO executor.Executor: Running task 0.0 in stage 7.0 (TID > 4) > 16/07/22 11:40:16 INFO storage.BlockManager: Found block rdd_2_0 locally > 16/07/22 11:40:17 INFO executor.Executor: Finished task 0.0 in stage 7.0 (TID > 4). 2738 bytes result sent to driver > 16/07/22 11:40:17 INFO scheduler.TaskSetManager: Finished task 0.0 in stage > 7.0 (TID 4) in 920 ms on localhost (1/2) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16573) executor stderr processing tools
[ https://issues.apache.org/jira/browse/SPARK-16573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16573. --- Resolution: Won't Fix > executor stderr processing tools > - > > Key: SPARK-16573 > URL: https://issues.apache.org/jira/browse/SPARK-16573 > Project: Spark > Issue Type: Improvement > Components: Input/Output > Environment: spark 1.6.0 >Reporter: Norman He > Original Estimate: 672h > Remaining Estimate: 672h > > 1) from eacch executor, I can view stderr of each one > 2) I would like to filter all the stderr of every executor ( saying I am > interested #18 output of all executor ) and I would like to add the number in > them by grepping certain field and sum them up. > 3) TensorBoard having this kind of capability -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16574) Distribute computing to each node based on certain hints
[ https://issues.apache.org/jira/browse/SPARK-16574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16574. --- Resolution: Not A Problem > Distribute computing to each node based on certain hints > > > Key: SPARK-16574 > URL: https://issues.apache.org/jira/browse/SPARK-16574 > Project: Spark > Issue Type: Wish >Reporter: Norman He > > 1) I have gpuWorkers RDD like(each node have 2 gpus) > val nodes= 10 > val gpuCount = 2 > val cross: Array[(Int, Int)] = for( x <- Array.range(0, nodes); y <- > Array.range(0, gpuCount ) ) yield (x, y) > var gpuWorkers: RDD[(Int, Int)] = sc.parallelize(cross, nodes * gpuCount) > 2) when executor runs, I would somehow like to distribute code to each nodes > based on cross's gpu index(y) so that each machine 2 gpu can be used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16601) Spark2.0 fail in creating table using sql statement "create table `db.tableName` xxx" while spark1.6 supports
[ https://issues.apache.org/jira/browse/SPARK-16601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16601. --- Resolution: Not A Problem > Spark2.0 fail in creating table using sql statement "create table > `db.tableName` xxx" while spark1.6 supports > - > > Key: SPARK-16601 > URL: https://issues.apache.org/jira/browse/SPARK-16601 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: marymwu >Priority: Minor > Attachments: error log.png > > > Spark2.0 fail in creating table using sql statement "create table > `db.tableName` xxx" while spark1.6 supports. > error log is attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16694) Use for/foreach rather than map for Unit expressions whose side effects are required
[ https://issues.apache.org/jira/browse/SPARK-16694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16694: Assignee: Apache Spark (was: Sean Owen) > Use for/foreach rather than map for Unit expressions whose side effects are > required > > > Key: SPARK-16694 > URL: https://issues.apache.org/jira/browse/SPARK-16694 > Project: Spark > Issue Type: Improvement > Components: Examples, MLlib, Spark Core, SQL, Streaming >Reporter: Sean Owen >Assignee: Apache Spark >Priority: Minor > > {{map}} is misused in many places where {{foreach}} is intended. This caused > a bug in https://issues.apache.org/jira/browse/SPARK-16664 and might be a > latent bug elsewhere; it's also easy to find with IJ inspections. Worth > patching up. > To illustrate the general problem, {{map}} happens to work in Scala where the > collection isn't lazy, but will fail to execute the code when it is. {{map}} > also causes a collection of {{Unit}} to be created pointlessly. > {code} > scala> val foo = Seq(1,2,3) > foo: Seq[Int] = List(1, 2, 3) > scala> foo.map(println) > 1 > 2 > 3 > res0: Seq[Unit] = List((), (), ()) > scala> foo.view.map(println) > res1: scala.collection.SeqView[Unit,Seq[_]] = SeqViewM(...) > scala> foo.view.foreach(println) > 1 > 2 > 3 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16694) Use for/foreach rather than map for Unit expressions whose side effects are required
[ https://issues.apache.org/jira/browse/SPARK-16694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16694: Assignee: Sean Owen (was: Apache Spark) > Use for/foreach rather than map for Unit expressions whose side effects are > required > > > Key: SPARK-16694 > URL: https://issues.apache.org/jira/browse/SPARK-16694 > Project: Spark > Issue Type: Improvement > Components: Examples, MLlib, Spark Core, SQL, Streaming >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > {{map}} is misused in many places where {{foreach}} is intended. This caused > a bug in https://issues.apache.org/jira/browse/SPARK-16664 and might be a > latent bug elsewhere; it's also easy to find with IJ inspections. Worth > patching up. > To illustrate the general problem, {{map}} happens to work in Scala where the > collection isn't lazy, but will fail to execute the code when it is. {{map}} > also causes a collection of {{Unit}} to be created pointlessly. > {code} > scala> val foo = Seq(1,2,3) > foo: Seq[Int] = List(1, 2, 3) > scala> foo.map(println) > 1 > 2 > 3 > res0: Seq[Unit] = List((), (), ()) > scala> foo.view.map(println) > res1: scala.collection.SeqView[Unit,Seq[_]] = SeqViewM(...) > scala> foo.view.foreach(println) > 1 > 2 > 3 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16694) Use for/foreach rather than map for Unit expressions whose side effects are required
[ https://issues.apache.org/jira/browse/SPARK-16694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15390982#comment-15390982 ] Apache Spark commented on SPARK-16694: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/14332 > Use for/foreach rather than map for Unit expressions whose side effects are > required > > > Key: SPARK-16694 > URL: https://issues.apache.org/jira/browse/SPARK-16694 > Project: Spark > Issue Type: Improvement > Components: Examples, MLlib, Spark Core, SQL, Streaming >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > {{map}} is misused in many places where {{foreach}} is intended. This caused > a bug in https://issues.apache.org/jira/browse/SPARK-16664 and might be a > latent bug elsewhere; it's also easy to find with IJ inspections. Worth > patching up. > To illustrate the general problem, {{map}} happens to work in Scala where the > collection isn't lazy, but will fail to execute the code when it is. {{map}} > also causes a collection of {{Unit}} to be created pointlessly. > {code} > scala> val foo = Seq(1,2,3) > foo: Seq[Int] = List(1, 2, 3) > scala> foo.map(println) > 1 > 2 > 3 > res0: Seq[Unit] = List((), (), ()) > scala> foo.view.map(println) > res1: scala.collection.SeqView[Unit,Seq[_]] = SeqViewM(...) > scala> foo.view.foreach(println) > 1 > 2 > 3 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16694) Use for/foreach rather than map for Unit expressions whose side effects are required
Sean Owen created SPARK-16694: - Summary: Use for/foreach rather than map for Unit expressions whose side effects are required Key: SPARK-16694 URL: https://issues.apache.org/jira/browse/SPARK-16694 Project: Spark Issue Type: Improvement Components: Examples, MLlib, Spark Core, SQL, Streaming Reporter: Sean Owen Assignee: Sean Owen Priority: Minor {{map}} is misused in many places where {{foreach}} is intended. This caused a bug in https://issues.apache.org/jira/browse/SPARK-16664 and might be a latent bug elsewhere; it's also easy to find with IJ inspections. Worth patching up. To illustrate the general problem, {{map}} happens to work in Scala where the collection isn't lazy, but will fail to execute the code when it is. {{map}} also causes a collection of {{Unit}} to be created pointlessly. {code} scala> val foo = Seq(1,2,3) foo: Seq[Int] = List(1, 2, 3) scala> foo.map(println) 1 2 3 res0: Seq[Unit] = List((), (), ()) scala> foo.view.map(println) res1: scala.collection.SeqView[Unit,Seq[_]] = SeqViewM(...) scala> foo.view.foreach(println) 1 2 3 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org