[jira] [Resolved] (SPARK-16173) Can't join describe() of DataFrame in Scala 2.10

2016-06-24 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-16173.

   Resolution: Fixed
Fix Version/s: 1.6.2

Issue resolved by pull request 13902
[https://github.com/apache/spark/pull/13902]

> Can't join describe() of DataFrame in Scala 2.10
> 
>
> Key: SPARK-16173
> URL: https://issues.apache.org/jira/browse/SPARK-16173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Davies Liu
> Fix For: 1.6.2
>
>
> descripbe() of DataFrame use Seq() (it's a Iterator actually) to create 
> another DataFrame, which can not be serialized in Scala 2.10.
> {code}
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2060)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706)
>   at 
> org.apache.spark.sql.execution.ConvertToUnsafe.doExecute(rowFormatConverters.scala:38)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:82)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:79)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.NotSerializableException: 
> scala.collection.Iterator$$anon$11
> Serialization stack:
>   - object not serializable (class: scala.collection.Iterator$$anon$11, 
> value: empty iterator)
>   - field (class: scala.collection.Iterator$$anonfun$toStream$1, name: 
> $outer, type: interface scala.collection.Iterator)
>   - object (class scala.collection.Iterator$$anonfun$toStream$1, 
> )
>   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
> interface scala.Function0)
>   - object (class scala.collection.immutable.Stream$Cons, 
> Stream(WrappedArray(1), WrappedArray(2.0), WrappedArray(NaN), 
> WrappedArray(2), WrappedArray(2)))
>   - field (class: scala.collection.immutable.Stream$$anonfun$zip$1, name: 
> $outer, type: class scala.collection.immutable.Stream)
>   - object (class scala.collection.immutable.Stream$$anonfun$zip$1, 
> )
>   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
> interface scala.Function0)
>   - object (class scala.collection.immutable.Stream$Cons, 
> Stream((WrappedArray(1),(count,)), 
> (WrappedArray(2.0),(mean,)), 
> (WrappedArray(NaN),(stddev,)), 
> (WrappedArray(2),(min,)), (WrappedArray(2),(max,
>   - field (class: scala.collection.immutable.Stream$$anonfun$map$1, name: 
> $outer, type: class scala.collection.immutable.Stream)
>   - object (class scala.collection.immutable.Stream$$anonfun$map$1, 
> )
>   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
> interface scala.Function0)
>   - object (class scala.collection.immutable.Stream$Cons, 
> 

[jira] [Resolved] (SPARK-16186) Support partition batch pruning with `IN` predicate in InMemoryTableScanExec

2016-06-24 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-16186.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 13887
[https://github.com/apache/spark/pull/13887]

> Support partition batch pruning with `IN` predicate in InMemoryTableScanExec
> 
>
> Key: SPARK-16186
> URL: https://issues.apache.org/jira/browse/SPARK-16186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
> Fix For: 2.1.0
>
>
> One of the most frequent usage patterns for Spark SQL is using **cached 
> tables**.
> This issue improves `InMemoryTableScanExec` to handle `IN` predicate 
> efficiently by pruning partition batches.
> Of course, the performance improvement varies over the queries and the 
> datasets. For the following simple query, the query duration in Spark UI goes 
> from 9 seconds to 50~90ms. It's about over 100 times faster.
> {code}
> $ bin/spark-shell --driver-memory 6G
> scala> val df = spark.range(20)
> scala> df.createOrReplaceTempView("t")
> scala> spark.catalog.cacheTable("t")
> scala> sql("select id from t where id = 1").collect()// About 2 mins
> scala> sql("select id from t where id = 1").collect()// less than 90ms
> scala> sql("select id from t where id in (1,2,3)").collect()  // 9 seconds 
> (Before)
> scala> sql("select id from t where id in (1,2,3)").collect() // less than 
> 90ms (After)
> {code}
> This issue has impacts over 35 queries of TPC-DS if the tables are cached.
> Note that this optimization is applied for IN. To apply IN predicate having 
> more than 10 items, *spark.sql.optimizer.inSetConversionThreshold* option 
> should be increased.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16077) Python UDF may fail because of six

2016-06-24 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-16077.

   Resolution: Fixed
Fix Version/s: 2.0.1
   1.6.3

Issue resolved by pull request 13788
[https://github.com/apache/spark/pull/13788]

> Python UDF may fail because of six
> --
>
> Key: SPARK-16077
> URL: https://issues.apache.org/jira/browse/SPARK-16077
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Davies Liu
> Fix For: 1.6.3, 2.0.1
>
>
> six or other package may break pickle.whichmodule() in pickle:
> https://bitbucket.org/gutworth/six/issues/63/importing-six-breaks-pickling



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16077) Python UDF may fail because of six

2016-06-24 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-16077:
--

Assignee: Davies Liu

> Python UDF may fail because of six
> --
>
> Key: SPARK-16077
> URL: https://issues.apache.org/jira/browse/SPARK-16077
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.6.3, 2.0.1
>
>
> six or other package may break pickle.whichmodule() in pickle:
> https://bitbucket.org/gutworth/six/issues/63/importing-six-breaks-pickling



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16179) UDF explosion yielding empty dataframe fails

2016-06-23 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-16179:
--

Assignee: Davies Liu

> UDF explosion yielding empty dataframe fails
> 
>
> Key: SPARK-16179
> URL: https://issues.apache.org/jira/browse/SPARK-16179
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
>Assignee: Davies Liu
>
> Command to replicate 
> https://gist.github.com/vlad17/cff2bab81929f44556a364ee90981ac0
> Resulting failure
> https://gist.github.com/vlad17/964c0a93510d79cb130c33700f6139b7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16179) UDF explosion yielding empty dataframe fails

2016-06-23 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-16179:
---
Affects Version/s: 2.0.0

> UDF explosion yielding empty dataframe fails
> 
>
> Key: SPARK-16179
> URL: https://issues.apache.org/jira/browse/SPARK-16179
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
>
> Command to replicate 
> https://gist.github.com/vlad17/cff2bab81929f44556a364ee90981ac0
> Resulting failure
> https://gist.github.com/vlad17/964c0a93510d79cb130c33700f6139b7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16180) Task hang on fetching blocks (cached RDD)

2016-06-23 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-16180:
---
Description: 
Here is the stackdump of executor:

{code}
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
scala.concurrent.Await$.result(package.scala:107)
org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102)
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588)
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585)
org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570)
org.apache.spark.storage.BlockManager.get(BlockManager.scala:630)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:46)
org.apache.spark.scheduler.Task.run(Task.scala:96)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

{code}

  was:
{code}
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
scala.concurrent.Await$.result(package.scala:107)
org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102)
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588)
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585)
org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570)
org.apache.spark.storage.BlockManager.get(BlockManager.scala:630)

[jira] [Updated] (SPARK-16180) Task hang on fetching blocks (cached RDD)

2016-06-23 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-16180:
---
Affects Version/s: 1.6.1

> Task hang on fetching blocks (cached RDD)
> -
>
> Key: SPARK-16180
> URL: https://issues.apache.org/jira/browse/SPARK-16180
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.1
>Reporter: Davies Liu
>
> {code}
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> scala.concurrent.Await$.result(package.scala:107)
> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102)
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588)
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585)
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585)
> org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570)
> org.apache.spark.storage.BlockManager.get(BlockManager.scala:630)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:46)
> org.apache.spark.scheduler.Task.run(Task.scala:96)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16180) Task hang on fetching blocks (cached RDD)

2016-06-23 Thread Davies Liu (JIRA)
Davies Liu created SPARK-16180:
--

 Summary: Task hang on fetching blocks (cached RDD)
 Key: SPARK-16180
 URL: https://issues.apache.org/jira/browse/SPARK-16180
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu


{code}
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
scala.concurrent.Await$.result(package.scala:107)
org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102)
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588)
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585)
org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570)
org.apache.spark.storage.BlockManager.get(BlockManager.scala:630)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:46)
org.apache.spark.scheduler.Task.run(Task.scala:96)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16175) Handle None for all Python UDT

2016-06-23 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-16175:
---
Affects Version/s: 2.0.0
   1.6.1

> Handle None for all Python UDT
> --
>
> Key: SPARK-16175
> URL: https://issues.apache.org/jira/browse/SPARK-16175
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Davies Liu
> Attachments: nullvector.dbc
>
>
> For Scala UDT, we will not call serialize()/deserialize() for all null, we 
> should also do that in Python. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16175) Handle None for all Python UDT

2016-06-23 Thread Davies Liu (JIRA)
Davies Liu created SPARK-16175:
--

 Summary: Handle None for all Python UDT
 Key: SPARK-16175
 URL: https://issues.apache.org/jira/browse/SPARK-16175
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Reporter: Davies Liu


For Scala UDT, we will not call serialize()/deserialize() for all null, we 
should also do that in Python. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16163) Statistics of logical plan is super slow on large query

2016-06-23 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347136#comment-15347136
 ] 

Davies Liu commented on SPARK-16163:


[~srowen] yes, thanks for correct it.

> Statistics of logical plan is super slow on large query
> ---
>
> Key: SPARK-16163
> URL: https://issues.apache.org/jira/browse/SPARK-16163
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Davies Liu
> Fix For: 2.0.1
>
>
> It took several minutes to plan TPC-DS Q64, because the canBroadcast() is 
> super slow on large plan.
> Right now, we are considering the schema in statistics(), it's not trivial 
> anymore, we should cache the result (using lazy val rather than def).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16173) Can't join describe() of DataFrame in Scala 2.10

2016-06-23 Thread Davies Liu (JIRA)
Davies Liu created SPARK-16173:
--

 Summary: Can't join describe() of DataFrame in Scala 2.10
 Key: SPARK-16173
 URL: https://issues.apache.org/jira/browse/SPARK-16173
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1, 1.5.2, 2.0.0
Reporter: Davies Liu


descripbe() of DataFrame use Seq() (it's a Iterator actually) to create another 
DataFrame, which can not be serialized in Scala 2.10.

{code}
org.apache.spark.SparkException: Task not serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2060)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706)
at 
org.apache.spark.sql.execution.ConvertToUnsafe.doExecute(rowFormatConverters.scala:38)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:82)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:79)
at 
org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:100)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.NotSerializableException: scala.collection.Iterator$$anon$11
Serialization stack:
- object not serializable (class: scala.collection.Iterator$$anon$11, 
value: empty iterator)
- field (class: scala.collection.Iterator$$anonfun$toStream$1, name: 
$outer, type: interface scala.collection.Iterator)
- object (class scala.collection.Iterator$$anonfun$toStream$1, 
)
- field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
interface scala.Function0)
- object (class scala.collection.immutable.Stream$Cons, 
Stream(WrappedArray(1), WrappedArray(2.0), WrappedArray(NaN), WrappedArray(2), 
WrappedArray(2)))
- field (class: scala.collection.immutable.Stream$$anonfun$zip$1, name: 
$outer, type: class scala.collection.immutable.Stream)
- object (class scala.collection.immutable.Stream$$anonfun$zip$1, 
)
- field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
interface scala.Function0)
- object (class scala.collection.immutable.Stream$Cons, 
Stream((WrappedArray(1),(count,)), 
(WrappedArray(2.0),(mean,)), 
(WrappedArray(NaN),(stddev,)), (WrappedArray(2),(min,)), 
(WrappedArray(2),(max,
- field (class: scala.collection.immutable.Stream$$anonfun$map$1, name: 
$outer, type: class scala.collection.immutable.Stream)
- object (class scala.collection.immutable.Stream$$anonfun$map$1, 
)
- field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
interface scala.Function0)
- object (class scala.collection.immutable.Stream$Cons, 
Stream([count,1], [mean,2.0], [stddev,NaN], [min,2], [max,2]))
- field (class: scala.collection.immutable.Stream$$anonfun$map$1, name: 
$outer, type: class scala.collection.immutable.Stream)
- object (class scala.collection.immutable.Stream$$anonfun$map$1, 
)
- field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
interface scala.Function0)
- object (class 

[jira] [Resolved] (SPARK-16163) Statistics of logical plan is super slow on large query

2016-06-23 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-16163.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13871
[https://github.com/apache/spark/pull/13871]

> Statistics of logical plan is super slow on large query
> ---
>
> Key: SPARK-16163
> URL: https://issues.apache.org/jira/browse/SPARK-16163
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Davies Liu
> Fix For: 2.0.0
>
>
> It took several minutes to plan TPC-DS Q64, because the canBroadcast() is 
> super slow on large plan.
> Right now, we are considering the schema in statistics(), it's not trivial 
> anymore, we should cache the result (using lazy val rather than def).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16163) Statistics of logical plan is super slow on large query

2016-06-22 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-16163:
---
Affects Version/s: 2.0.0

> Statistics of logical plan is super slow on large query
> ---
>
> Key: SPARK-16163
> URL: https://issues.apache.org/jira/browse/SPARK-16163
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Davies Liu
>
> It took several minutes to plan TPC-DS Q64, because the canBroadcast() is 
> super slow on large plan.
> Right now, we are considering the schema in statistics(), it's not trivial 
> anymore, we should cache the result (using lazy val rather than def).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16163) Statistics of logical plan is super slow on large query

2016-06-22 Thread Davies Liu (JIRA)
Davies Liu created SPARK-16163:
--

 Summary: Statistics of logical plan is super slow on large query
 Key: SPARK-16163
 URL: https://issues.apache.org/jira/browse/SPARK-16163
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu


It took several minutes to plan TPC-DS Q64, because the canBroadcast() is super 
slow on large plan.

Right now, we are considering the schema in statistics(), it's not trivial 
anymore, we should cache the result (using lazy val rather than def).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16003) SerializationDebugger run into infinite loop

2016-06-22 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-16003.

   Resolution: Fixed
Fix Version/s: 2.0.0

> SerializationDebugger run into infinite loop
> 
>
> Key: SPARK-16003
> URL: https://issues.apache.org/jira/browse/SPARK-16003
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Eric Liang
>Priority: Critical
> Fix For: 2.0.0
>
>
> This is observed while debugging 
> https://issues.apache.org/jira/browse/SPARK-15811
> {code}
> sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> java.lang.reflect.Method.invoke(Method.java:497) 
> java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1118) 
> sun.reflect.GeneratedMethodAccessor84.invoke(Unknown Source) 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> java.lang.reflect.Method.invoke(Method.java:497) 
> org.apache.spark.serializer.SerializationDebugger$ObjectStreamClassMethods$.invokeWriteReplace$extension(SerializationDebugger.scala:347)
>  
> org.apache.spark.serializer.SerializationDebugger$.org$apache$spark$serializer$SerializationDebugger$$findObjectAndDescriptor(SerializationDebugger.scala:269)
>  
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:154)
>  
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
>  
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:206)
>  
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
>  
> org.apache.spark.serializer.SerializationDebugger$.find(SerializationDebugger.scala:67)
>  
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
>  
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
>  
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
>  
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
>  
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>  
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) 
> org.apache.spark.SparkContext.clean(SparkContext.scala:2038) 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:789)
>  
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:788)
>  
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  org.apache.spark.rdd.RDD.withScope(RDD.scala:357) 
> org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:788) 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:355)
>  
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>  
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>  
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>  
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:240) 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:323) 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
>  
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2163)
>  
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>  org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2489) 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2162)
>  
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2169)
>  org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1905) 
> org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1904) 
> org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2519) 
> org.apache.spark.sql.Dataset.head(Dataset.scala:1904) 
> org.apache.spark.sql.Dataset.take(Dataset.scala:2119) 
> 

[jira] [Updated] (SPARK-16003) SerializationDebugger run into infinite loop

2016-06-22 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-16003:
---
Assignee: Eric Liang

> SerializationDebugger run into infinite loop
> 
>
> Key: SPARK-16003
> URL: https://issues.apache.org/jira/browse/SPARK-16003
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Eric Liang
>Priority: Critical
> Fix For: 2.0.0
>
>
> This is observed while debugging 
> https://issues.apache.org/jira/browse/SPARK-15811
> {code}
> sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> java.lang.reflect.Method.invoke(Method.java:497) 
> java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1118) 
> sun.reflect.GeneratedMethodAccessor84.invoke(Unknown Source) 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> java.lang.reflect.Method.invoke(Method.java:497) 
> org.apache.spark.serializer.SerializationDebugger$ObjectStreamClassMethods$.invokeWriteReplace$extension(SerializationDebugger.scala:347)
>  
> org.apache.spark.serializer.SerializationDebugger$.org$apache$spark$serializer$SerializationDebugger$$findObjectAndDescriptor(SerializationDebugger.scala:269)
>  
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:154)
>  
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
>  
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:206)
>  
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
>  
> org.apache.spark.serializer.SerializationDebugger$.find(SerializationDebugger.scala:67)
>  
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
>  
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
>  
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
>  
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
>  
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>  
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) 
> org.apache.spark.SparkContext.clean(SparkContext.scala:2038) 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:789)
>  
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:788)
>  
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  org.apache.spark.rdd.RDD.withScope(RDD.scala:357) 
> org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:788) 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:355)
>  
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>  
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>  
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>  
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:240) 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:323) 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
>  
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2163)
>  
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>  org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2489) 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2162)
>  
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2169)
>  org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1905) 
> org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1904) 
> org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2519) 
> org.apache.spark.sql.Dataset.head(Dataset.scala:1904) 
> org.apache.spark.sql.Dataset.take(Dataset.scala:2119) 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:80)
>  
> 

[jira] [Updated] (SPARK-16104) Do not creaate CSV writer object for every flush when writing

2016-06-21 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-16104:
---
Assignee: Hyukjin Kwon

> Do not creaate CSV writer object for every flush when writing
> -
>
> Key: SPARK-16104
> URL: https://issues.apache.org/jira/browse/SPARK-16104
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> Initially, CSV data source creates {{CsvWriter}} for each record but it was 
> fixed in SPARK-14031.
> However, it still creates a writer for each flush in {{LineCsvWriter}}. This 
> is not necessary. It might be better if it uses single {{CsvWriter}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16104) Do not creaate CSV writer object for every flush when writing

2016-06-21 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-16104.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 13809
[https://github.com/apache/spark/pull/13809]

> Do not creaate CSV writer object for every flush when writing
> -
>
> Key: SPARK-16104
> URL: https://issues.apache.org/jira/browse/SPARK-16104
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> Initially, CSV data source creates {{CsvWriter}} for each record but it was 
> fixed in SPARK-14031.
> However, it still creates a writer for each flush in {{LineCsvWriter}}. This 
> is not necessary. It might be better if it uses single {{CsvWriter}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16086) Python UDF failed when there is no arguments

2016-06-21 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-16086.

   Resolution: Fixed
Fix Version/s: (was: 1.6.2)
   (was: 1.5.3)
   2.0.0

Issue resolved by pull request 13812
[https://github.com/apache/spark/pull/13812]

> Python UDF failed when there is no arguments
> 
>
> Key: SPARK-16086
> URL: https://issues.apache.org/jira/browse/SPARK-16086
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2, 1.6.1
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> {code}
> >>> sqlContext.registerFunction("f", lambda : "a")
> >>> sqlContext.sql("select f()").show()
> {code}
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 171.0 failed 4 times, most recent failure: Lost task 0.3 in stage 171.0 
> (TID 6226, ip-10-0-243-36.us-west-2.compute.internal): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/databricks/spark/python/pyspark/worker.py", line 111, in main
> process()
>   File "/databricks/spark/python/pyspark/worker.py", line 106, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/databricks/spark/python/pyspark/serializers.py", line 263, in 
> dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File "/databricks/spark/python/pyspark/serializers.py", line 139, in 
> load_stream
> yield self._read_with_length(stream)
>   File "/databricks/spark/python/pyspark/serializers.py", line 164, in 
> _read_with_length
> return self.loads(obj)
>   File "/databricks/spark/python/pyspark/serializers.py", line 422, in loads
> return pickle.loads(obj)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 1159, in 
> return lambda *a: dataType.fromInternal(a)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 568, in 
> fromInternal
> return _create_row(self.names, values)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 1163, in 
> _create_row
> row = Row(*values)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 1210, in __new__
> raise ValueError("No args or kwargs")
> ValueError: (ValueError('No args or kwargs',),  at 
> 0x7f3bbc463320>, ())
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
>   at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>   at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:405)
>   at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:370)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:72)
>   at org.apache.spark.scheduler.Task.run(Task.scala:96)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16077) Python UDF may fail because of six

2016-06-21 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342262#comment-15342262
 ] 

Davies Liu commented on SPARK-16077:


[~bill_chambers] We fixed it for some cases, could still fail in other corner 
cases.

> Python UDF may fail because of six
> --
>
> Key: SPARK-16077
> URL: https://issues.apache.org/jira/browse/SPARK-16077
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Davies Liu
>
> six or other package may break pickle.whichmodule() in pickle:
> https://bitbucket.org/gutworth/six/issues/63/importing-six-breaks-pickling



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16086) Python UDF failed when there is no arguments

2016-06-20 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-16086.

   Resolution: Fixed
Fix Version/s: 1.6.2
   1.5.3
   2.0.0

Issue resolved by pull request 13793
[https://github.com/apache/spark/pull/13793]

> Python UDF failed when there is no arguments
> 
>
> Key: SPARK-16086
> URL: https://issues.apache.org/jira/browse/SPARK-16086
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2, 1.6.1
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0, 1.5.3, 1.6.2
>
>
> {code}
> >>> sqlContext.registerFunction("f", lambda : "a")
> >>> sqlContext.sql("select f()").show()
> {code}
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 171.0 failed 4 times, most recent failure: Lost task 0.3 in stage 171.0 
> (TID 6226, ip-10-0-243-36.us-west-2.compute.internal): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/databricks/spark/python/pyspark/worker.py", line 111, in main
> process()
>   File "/databricks/spark/python/pyspark/worker.py", line 106, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/databricks/spark/python/pyspark/serializers.py", line 263, in 
> dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File "/databricks/spark/python/pyspark/serializers.py", line 139, in 
> load_stream
> yield self._read_with_length(stream)
>   File "/databricks/spark/python/pyspark/serializers.py", line 164, in 
> _read_with_length
> return self.loads(obj)
>   File "/databricks/spark/python/pyspark/serializers.py", line 422, in loads
> return pickle.loads(obj)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 1159, in 
> return lambda *a: dataType.fromInternal(a)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 568, in 
> fromInternal
> return _create_row(self.names, values)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 1163, in 
> _create_row
> row = Row(*values)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 1210, in __new__
> raise ValueError("No args or kwargs")
> ValueError: (ValueError('No args or kwargs',),  at 
> 0x7f3bbc463320>, ())
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
>   at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>   at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:405)
>   at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:370)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:72)
>   at org.apache.spark.scheduler.Task.run(Task.scala:96)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16086) Python UDF failed when there is no arguments

2016-06-20 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-16086:
---
Description: 
{code}
>>> sqlContext.registerFunction("f", lambda : "a")
>>> sqlContext.sql("select f()").show()
{code}

{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 171.0 failed 4 times, most recent failure: Lost task 0.3 in stage 171.0 
(TID 6226, ip-10-0-243-36.us-west-2.compute.internal): 
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/worker.py", line 111, in main
process()
  File "/databricks/spark/python/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File "/databricks/spark/python/pyspark/serializers.py", line 263, in 
dump_stream
vs = list(itertools.islice(iterator, batch))
  File "/databricks/spark/python/pyspark/serializers.py", line 139, in 
load_stream
yield self._read_with_length(stream)
  File "/databricks/spark/python/pyspark/serializers.py", line 164, in 
_read_with_length
return self.loads(obj)
  File "/databricks/spark/python/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
  File "/databricks/spark/python/pyspark/sql/types.py", line 1159, in 
return lambda *a: dataType.fromInternal(a)
  File "/databricks/spark/python/pyspark/sql/types.py", line 568, in 
fromInternal
return _create_row(self.names, values)
  File "/databricks/spark/python/pyspark/sql/types.py", line 1163, in 
_create_row
row = Row(*values)
  File "/databricks/spark/python/pyspark/sql/types.py", line 1210, in __new__
raise ValueError("No args or kwargs")
ValueError: (ValueError('No args or kwargs',),  at 
0x7f3bbc463320>, ())

at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at 
org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:405)
at 
org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:370)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:72)
at org.apache.spark.scheduler.Task.run(Task.scala:96)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

{code}

  was:
{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 171.0 failed 4 times, most recent failure: Lost task 0.3 in stage 171.0 
(TID 6226, ip-10-0-243-36.us-west-2.compute.internal): 
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/worker.py", line 111, in main
process()
  File "/databricks/spark/python/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File "/databricks/spark/python/pyspark/serializers.py", line 263, in 
dump_stream
vs = list(itertools.islice(iterator, batch))
  File "/databricks/spark/python/pyspark/serializers.py", line 139, in 
load_stream
yield self._read_with_length(stream)
  File "/databricks/spark/python/pyspark/serializers.py", line 164, in 
_read_with_length
return self.loads(obj)
  File "/databricks/spark/python/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
  File "/databricks/spark/python/pyspark/sql/types.py", line 1159, in 
return lambda *a: 

[jira] [Updated] (SPARK-16086) Python UDF failed when there is no arguments

2016-06-20 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-16086:
---
Affects Version/s: 1.5.2

> Python UDF failed when there is no arguments
> 
>
> Key: SPARK-16086
> URL: https://issues.apache.org/jira/browse/SPARK-16086
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2, 1.6.1
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 171.0 failed 4 times, most recent failure: Lost task 0.3 in stage 171.0 
> (TID 6226, ip-10-0-243-36.us-west-2.compute.internal): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/databricks/spark/python/pyspark/worker.py", line 111, in main
> process()
>   File "/databricks/spark/python/pyspark/worker.py", line 106, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/databricks/spark/python/pyspark/serializers.py", line 263, in 
> dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File "/databricks/spark/python/pyspark/serializers.py", line 139, in 
> load_stream
> yield self._read_with_length(stream)
>   File "/databricks/spark/python/pyspark/serializers.py", line 164, in 
> _read_with_length
> return self.loads(obj)
>   File "/databricks/spark/python/pyspark/serializers.py", line 422, in loads
> return pickle.loads(obj)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 1159, in 
> return lambda *a: dataType.fromInternal(a)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 568, in 
> fromInternal
> return _create_row(self.names, values)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 1163, in 
> _create_row
> row = Row(*values)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 1210, in __new__
> raise ValueError("No args or kwargs")
> ValueError: (ValueError('No args or kwargs',),  at 
> 0x7f3bbc463320>, ())
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
>   at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>   at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:405)
>   at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:370)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:72)
>   at org.apache.spark.scheduler.Task.run(Task.scala:96)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16086) Python UDF failed when there is no arguments

2016-06-20 Thread Davies Liu (JIRA)
Davies Liu created SPARK-16086:
--

 Summary: Python UDF failed when there is no arguments
 Key: SPARK-16086
 URL: https://issues.apache.org/jira/browse/SPARK-16086
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.6.1
Reporter: Davies Liu
Assignee: Davies Liu


{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 171.0 failed 4 times, most recent failure: Lost task 0.3 in stage 171.0 
(TID 6226, ip-10-0-243-36.us-west-2.compute.internal): 
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/worker.py", line 111, in main
process()
  File "/databricks/spark/python/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File "/databricks/spark/python/pyspark/serializers.py", line 263, in 
dump_stream
vs = list(itertools.islice(iterator, batch))
  File "/databricks/spark/python/pyspark/serializers.py", line 139, in 
load_stream
yield self._read_with_length(stream)
  File "/databricks/spark/python/pyspark/serializers.py", line 164, in 
_read_with_length
return self.loads(obj)
  File "/databricks/spark/python/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
  File "/databricks/spark/python/pyspark/sql/types.py", line 1159, in 
return lambda *a: dataType.fromInternal(a)
  File "/databricks/spark/python/pyspark/sql/types.py", line 568, in 
fromInternal
return _create_row(self.names, values)
  File "/databricks/spark/python/pyspark/sql/types.py", line 1163, in 
_create_row
row = Row(*values)
  File "/databricks/spark/python/pyspark/sql/types.py", line 1210, in __new__
raise ValueError("No args or kwargs")
ValueError: (ValueError('No args or kwargs',),  at 
0x7f3bbc463320>, ())

at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at 
org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:405)
at 
org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:370)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:72)
at org.apache.spark.scheduler.Task.run(Task.scala:96)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16078) from_utc_timestamp/to_utc_timestamp may give different result in different timezone

2016-06-20 Thread Davies Liu (JIRA)
Davies Liu created SPARK-16078:
--

 Summary: from_utc_timestamp/to_utc_timestamp may give different 
result in different timezone
 Key: SPARK-16078
 URL: https://issues.apache.org/jira/browse/SPARK-16078
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1, 2.0.0
Reporter: Davies Liu
Assignee: Davies Liu


from_utc_timestamp/to_utc_timestamp should return determistic result in any 
timezone (system default).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16077) Python UDF may fail because of six

2016-06-20 Thread Davies Liu (JIRA)
Davies Liu created SPARK-16077:
--

 Summary: Python UDF may fail because of six
 Key: SPARK-16077
 URL: https://issues.apache.org/jira/browse/SPARK-16077
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Davies Liu


six or other package may break pickle.whichmodule() in pickle:

https://bitbucket.org/gutworth/six/issues/63/importing-six-breaks-pickling



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15613) Incorrect days to millis conversion

2016-06-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15613.

   Resolution: Fixed
Fix Version/s: 1.6.2
   2.0.0

Issue resolved by pull request 13652
[https://github.com/apache/spark/pull/13652]

> Incorrect days to millis conversion 
> 
>
> Key: SPARK-15613
> URL: https://issues.apache.org/jira/browse/SPARK-15613
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
> Environment: java version "1.8.0_91"
>Reporter: Dmitry Bushev
>Assignee: Apache Spark
>Priority: Critical
> Fix For: 2.0.0, 1.6.2
>
>
> There is an issue with {{DateTimeUtils.daysToMillis}} implementation. It  
> affects {{DateTimeUtils.toJavaDate}} and ultimately CatalystTypeConverter, 
> i.e the conversion of date stored as {{Int}} days from epoch in InternalRow 
> to {{java.sql.Date}} of Row returned to user.
>  
> The issue can be reproduced with this test (all the following tests are in my 
> defalut timezone Europe/Moscow):
> {code}
> $ sbt -Duser.timezone=Europe/Moscow catalyst/console
> scala> java.util.Calendar.getInstance().getTimeZone
> res0: java.util.TimeZone = 
> sun.util.calendar.ZoneInfo[id="Europe/Moscow",offset=1080,dstSavings=0,useDaylight=false,transitions=79,lastRule=null]
> scala> import org.apache.spark.sql.catalyst.util.DateTimeUtils._
> import org.apache.spark.sql.catalyst.util.DateTimeUtils._
> scala> for (days <- 0 to 2 if millisToDays(daysToMillis(days)) != days) 
> yield days
> res23: scala.collection.immutable.IndexedSeq[Int] = Vector(4108, 4473, 4838, 
> 5204, 5568, 5932, 6296, 6660, 7024, 7388, 8053, 8487, 8851, 9215, 9586, 9950, 
> 10314, 10678, 11042, 11406, 11777, 12141, 12505, 12869, 13233, 13597, 13968, 
> 14332, 14696, 15060)
> {code}
> For example, for {{4108}} day of epoch, the correct date should be 
> {{1981-04-01}}
> {code}
> scala> DateTimeUtils.toJavaDate(4107)
> res25: java.sql.Date = 1981-03-31
> scala> DateTimeUtils.toJavaDate(4108)
> res26: java.sql.Date = 1981-03-31
> scala> DateTimeUtils.toJavaDate(4109)
> res27: java.sql.Date = 1981-04-02
> {code}
> There was previous unsuccessful attempt to work around the problem in 
> SPARK-11415. It seems that issue involves flaws in java date implementation 
> and I don't see how it can be fixed without third-party libraries.
> I was not able to identify the library of choice for Spark. The following 
> implementation uses [JSR-310|http://www.threeten.org/]
> {code}
> def millisToDays(millisUtc: Long): SQLDate = {
>   val instant = Instant.ofEpochMilli(millisUtc)
>   val zonedDateTime = instant.atZone(ZoneId.systemDefault)
>   zonedDateTime.toLocalDate.toEpochDay.toInt
> }
> def daysToMillis(days: SQLDate): Long = {
>   val localDate = LocalDate.ofEpochDay(days)
>   val zonedDateTime = localDate.atStartOfDay(ZoneId.systemDefault)
>   zonedDateTime.toInstant.toEpochMilli
> }
> {code}
> that produces correct results:
> {code}
> scala> for (days <- 0 to 2 if millisToDays(daysToMillis(days)) != days) 
> yield days
> res37: scala.collection.immutable.IndexedSeq[Int] = Vector()
> scala> new java.sql.Date(daysToMillis(4108))
> res36: java.sql.Date = 1981-04-01
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15803) Support with statement syntax for SparkSession

2016-06-17 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15803:
---
Assignee: Jeff Zhang

> Support with statement syntax for SparkSession
> --
>
> Key: SPARK-15803
> URL: https://issues.apache.org/jira/browse/SPARK-15803
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Minor
> Fix For: 2.0.0
>
>
> It would be nice to support with statement syntax for SparkSession like 
> following
> {code}
> with SparkSession.builder.(...).getOrCreate() as session:
>   session.sql("show tables").show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15803) Support with statement syntax for SparkSession

2016-06-17 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15803.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13541
[https://github.com/apache/spark/pull/13541]

> Support with statement syntax for SparkSession
> --
>
> Key: SPARK-15803
> URL: https://issues.apache.org/jira/browse/SPARK-15803
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Priority: Minor
> Fix For: 2.0.0
>
>
> It would be nice to support with statement syntax for SparkSession like 
> following
> {code}
> with SparkSession.builder.(...).getOrCreate() as session:
>   session.sql("show tables").show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13753) Column nullable is derived incorrectly

2016-06-17 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336567#comment-15336567
 ] 

Davies Liu commented on SPARK-13753:


After discussed with [~cloud_fan], we do have runtime check to make sure that 
they key of Map could not be null, but do not have check on schema. So the 
query could fail, but should not return incorrect results.

Could you provide more on the expected result and actual results?

> Column nullable is derived incorrectly
> --
>
> Key: SPARK-13753
> URL: https://issues.apache.org/jira/browse/SPARK-13753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Jingwei Lu
>Priority: Critical
>
> There is a problem in spark sql to derive nullable column and used in 
> optimization incorrectly. In following query:
> {code}
> select concat("perf.realtime.web", b.tags[1]) as metric, b.value, b.tags[0]
>   from (
> select explode(map(a.frontend[0], 
> ARRAY(concat("metric:frontend", ",controller:", COALESCE(controller, "null"), 
> ",action:", COALESCE(action, "null")), ".p50"),
>  a.frontend[1], 
> ARRAY(concat("metric:frontend", ",controller:", COALESCE(controller, "null"), 
> ",action:", COALESCE(action, "null")), ".p90"),
>  a.backend[0], ARRAY(concat("metric:backend", 
> ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
> "null")), ".p50"),
>  a.backend[1], ARRAY(concat("metric:backend", 
> ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
> "null")), ".p90"),
>  a.render[0], ARRAY(concat("metric:render", 
> ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
> "null")), ".p50"),
>  a.render[1], ARRAY(concat("metric:render", 
> ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
> "null")), ".p90"),
>  a.page_load_time[0], 
> ARRAY(concat("metric:page_load_time", ",controller:", COALESCE(controller, 
> "null"), ",action:", COALESCE(action, "null")), ".p50"),
>  a.page_load_time[1], 
> ARRAY(concat("metric:page_load_time", ",controller:", COALESCE(controller, 
> "null"), ",action:", COALESCE(action, "null")), ".p90"),
>  a.total_load_time[0], 
> ARRAY(concat("metric:total_load_time", ",controller:", COALESCE(controller, 
> "null"), ",action:", COALESCE(action, "null")), ".p50"),
>  a.total_load_time[1], 
> ARRAY(concat("metric:total_load_time", ",controller:", COALESCE(controller, 
> "null"), ",action:", COALESCE(action, "null")), ".p90"))) as (value, tags)
> from (
>   select  data.controller as controller, data.action as 
> action,
>   percentile(data.frontend, array(0.5, 0.9)) as 
> frontend,
>   percentile(data.backend, array(0.5, 0.9)) as 
> backend,
>   percentile(data.render, array(0.5, 0.9)) as render,
>   percentile(data.page_load_time, array(0.5, 0.9)) as 
> page_load_time,
>   percentile(data.total_load_time, array(0.5, 0.9)) 
> as total_load_time
>   from air_events_rt
>   where type='air_events' and data.event_name='pageload'
>   group by data.controller, data.action
> ) a
>   ) b
>   where b.value is not null
> {code}
> b.value is incorrectly derived as not nullable.  "b.value is not null" 
> predicate will be ignored by optimizer which cause the query return incorrect 
> result. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16011) SQL metrics include duplicated attempts

2016-06-16 Thread Davies Liu (JIRA)
Davies Liu created SPARK-16011:
--

 Summary: SQL metrics include duplicated attempts
 Key: SPARK-16011
 URL: https://issues.apache.org/jira/browse/SPARK-16011
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Reporter: Davies Liu
Assignee: Wenchen Fan
Priority: Blocker


When I ran a simple scan and aggregate query, the number of rows in scan could 
be different from run to run, but actually scanned result is correct, the SQL 
metrics is wrong (should not include duplicated attempt), this is a regression 
since 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String

2016-06-16 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15822.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13723
[https://github.com/apache/spark/pull/13723]

> segmentation violation in o.a.s.unsafe.types.UTF8String 
> 
>
> Key: SPARK-15822
> URL: https://issues.apache.org/jira/browse/SPARK-15822
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: linux amd64
> openjdk version "1.8.0_91"
> OpenJDK Runtime Environment (build 1.8.0_91-b14)
> OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
>Reporter: Pete Robbins
>Assignee: Herman van Hovell
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Executors fail with segmentation violation while running application with
> spark.memory.offHeap.enabled true
> spark.memory.offHeap.size 512m
> Also now reproduced with 
> spark.memory.offHeap.enabled false
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400
> #
> # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # J 4816 C2 
> org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I
>  (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d]
> {noformat}
> We initially saw this on IBM java on PowerPC box but is recreatable on linux 
> with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the 
> same code point:
> {noformat}
> 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48)
> java.lang.NullPointerException
>   at 
> org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831)
>   at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
>   at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664)
>   at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.lang.Thread.run(Thread.java:785)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16003) SerializationDebugger run into infinite loop

2016-06-16 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-16003:
---
Description: 
This is observed while debugging 
https://issues.apache.org/jira/browse/SPARK-15811

{code}
sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 
java.lang.reflect.Method.invoke(Method.java:497) 
java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1118) 
sun.reflect.GeneratedMethodAccessor84.invoke(Unknown Source) 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 
java.lang.reflect.Method.invoke(Method.java:497) 
org.apache.spark.serializer.SerializationDebugger$ObjectStreamClassMethods$.invokeWriteReplace$extension(SerializationDebugger.scala:347)
 
org.apache.spark.serializer.SerializationDebugger$.org$apache$spark$serializer$SerializationDebugger$$findObjectAndDescriptor(SerializationDebugger.scala:269)
 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:154)
 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:206)
 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
 
org.apache.spark.serializer.SerializationDebugger$.find(SerializationDebugger.scala:67)
 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
 
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) 
org.apache.spark.SparkContext.clean(SparkContext.scala:2038) 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:789) 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:788) 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) 
org.apache.spark.rdd.RDD.withScope(RDD.scala:357) 
org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:788) 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:355)
 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) 
org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) 
org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:240) 
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:323) 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39) 
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2163)
 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
 org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2489) 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2162)
 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2169)
 org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1905) 
org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1904) 
org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2519) 
org.apache.spark.sql.Dataset.head(Dataset.scala:1904) 
org.apache.spark.sql.Dataset.take(Dataset.scala:2119) 
com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:80)
 
com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42)
 
com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$2.apply(ScalaDriverLocal.scala:196)
 
com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$2.apply(ScalaDriverLocal.scala:188)
 scala.Option.map(Option.scala:145) 
com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:188)
 
com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$3.apply(DriverLocal.scala:169)
 

[jira] [Updated] (SPARK-16003) SerializationDebugger run into infinite loop

2016-06-16 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-16003:
---
Description: 
This is observed while debugging 
https://issues.apache.org/jira/browse/SPARK-15811

{code}
sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 java.lang.reflect.Method.invoke(Method.java:497) 
java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1118) 
sun.reflect.GeneratedMethodAccessor84.invoke(Unknown Source) 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 java.lang.reflect.Method.invoke(Method.java:497) 
org.apache.spark.serializer.SerializationDebugger$ObjectStreamClassMethods$.invokeWriteReplace$extension(SerializationDebugger.scala:347)
 
org.apache.spark.serializer.SerializationDebugger$.org$apache$spark$serializer$SerializationDebugger$$findObjectAndDescriptor(SerializationDebugger.scala:269)
 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:154)
 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:206)
 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
 
org.apache.spark.serializer.SerializationDebugger$.find(SerializationDebugger.scala:67)
 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
 org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) 
org.apache.spark.SparkContext.clean(SparkContext.scala:2038) 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:789) 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:788) 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) 
org.apache.spark.rdd.RDD.withScope(RDD.scala:357) 
org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:788) 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:355)
 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) 
org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) 
org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:240) 
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:323) 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39) 
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2163)
 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
 org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2489) 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2162)
 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2169)
 org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1905) 
org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1904) 
org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2519) 
org.apache.spark.sql.Dataset.head(Dataset.scala:1904) 
org.apache.spark.sql.Dataset.take(Dataset.scala:2119) 
com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:80)
 
com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42)
 
com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$2.apply(ScalaDriverLocal.scala:196)
 
com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$2.apply(ScalaDriverLocal.scala:188)
 scala.Option.map(Option.scala:145) 
com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:188)
 
com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$3.apply(DriverLocal.scala:169)
 

[jira] [Created] (SPARK-16003) SerializationDebugger run into infinite loop

2016-06-16 Thread Davies Liu (JIRA)
Davies Liu created SPARK-16003:
--

 Summary: SerializationDebugger run into infinite loop
 Key: SPARK-16003
 URL: https://issues.apache.org/jira/browse/SPARK-16003
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Davies Liu
Priority: Critical


This is observed while debugging 
https://issues.apache.org/jira/browse/SPARK-15811

We should fix it or disable it by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15811) Python UDFs do not work in Spark 2.0-preview built with scala 2.10

2016-06-16 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15811:
---
Description: 
I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following

{code}
./dev/change-version-to-2.10.sh
./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
-Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
{code}

and then ran the following code in a pyspark shell

{code}
from pyspark.sql import SparkSession 
from pyspark.sql.types import IntegerType, StructField, StructType
from pyspark.sql.functions import udf
from pyspark.sql.types import Row
spark = SparkSession.builder.master('local[4]').appName('2.0 DF').getOrCreate()
add_one = udf(lambda x: x + 1, IntegerType())
schema = StructType([StructField('a', IntegerType(), False)])
df = sqlContext.createDataFrame([Row(a=1),Row(a=2)], schema)
df.select(add_one(df.a).alias('incremented')).collect()
{code}

This never returns with a result. 


  was:
I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following

{code}
./dev/change-version-to-2.10.sh
./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
-Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
{code}

and then ran the following code in a pyspark shell

{code}
from pyspark.sql.types import IntegerType, StructField, StructType
from pyspark.sql.functions import udf
from pyspark.sql.types import Row
spark = SparkSession.builder.master('local[4]').appName('2.0 DF').getOrCreate()
add_one = udf(lambda x: x + 1, IntegerType())
schema = StructType([StructField('a', IntegerType(), False)])
df = sqlContext.createDataFrame([Row(a=1),Row(a=2)], schema)
df.select(add_one(df.a).alias('incremented')).collect()
{code}

This never returns with a result. 



> Python UDFs do not work in Spark 2.0-preview built with scala 2.10
> --
>
> Key: SPARK-15811
> URL: https://issues.apache.org/jira/browse/SPARK-15811
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Franklyn Dsouza
>Assignee: Davies Liu
>Priority: Blocker
>
> I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following
> {code}
> ./dev/change-version-to-2.10.sh
> ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
> -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
> {code}
> and then ran the following code in a pyspark shell
> {code}
> from pyspark.sql import SparkSession 
> from pyspark.sql.types import IntegerType, StructField, StructType
> from pyspark.sql.functions import udf
> from pyspark.sql.types import Row
> spark = SparkSession.builder.master('local[4]').appName('2.0 
> DF').getOrCreate()
> add_one = udf(lambda x: x + 1, IntegerType())
> schema = StructType([StructField('a', IntegerType(), False)])
> df = sqlContext.createDataFrame([Row(a=1),Row(a=2)], schema)
> df.select(add_one(df.a).alias('incremented')).collect()
> {code}
> This never returns with a result. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15934) Return binary mode in ThriftServer

2016-06-15 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15934.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13667
[https://github.com/apache/spark/pull/13667]

> Return binary mode in ThriftServer
> --
>
> Key: SPARK-15934
> URL: https://issues.apache.org/jira/browse/SPARK-15934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Priority: Critical
> Fix For: 2.0.0
>
>
> In spark-2.0.0 preview binary mode was turned off (SPARK-15095). 
> It was greatly irresponsible step due to the fact, that in 1.6.1 binary mode 
> was default and it turned off in 2.0.0.
> Just to describe magnitude of harm not fixing this bug would do in my 
> organization:
> * Tableau works only though Thrift Server and only with binary format. 
> Tableau would not work with spark-2.0.0 at all!
> * I have bunch of analysts in my organization with configured sql 
> clients(DataGrip and Squirrel). I would need to go one by one to change 
> connection string for them(DataGrip). Squirrel simply do not work with http - 
> some jar hell in my case.
> * let me not mention all other stuff which connects to our data 
> infrastructure through ThriftServer as gateway. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15888) Python UDF over aggregate fails

2016-06-15 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15888.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13682
[https://github.com/apache/spark/pull/13682]

> Python UDF over aggregate fails
> ---
>
> Key: SPARK-15888
> URL: https://issues.apache.org/jira/browse/SPARK-15888
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Vladimir Feinberg
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 2.0.0
>
>
> This looks like a regression from 1.6.1.
> The following notebook runs without error in a Spark 1.6.1 cluster, but fails 
> in 2.0.0:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15811) Python UDFs do not work in Spark 2.0-preview built with scala 2.10

2016-06-15 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15811:
---
Description: 
I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following

{code}
./dev/change-version-to-2.10.sh
./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
-Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
{code}

and then ran the following code in a pyspark shell

{code}
from pyspark.sql.types import IntegerType, StructField, StructType
from pyspark.sql.functions import udf
from pyspark.sql.types import Row
spark = SparkSession.builder.master('local[4]').appName('2.0 DF').getOrCreate()
add_one = udf(lambda x: x + 1, IntegerType())
schema = StructType([StructField('a', IntegerType(), False)])
df = sqlContext.createDataFrame([Row(a=1),Row(a=2)], schema)
df.select(add_one(df.a).alias('incremented')).collect()
{code}

This never returns with a result. 


  was:
I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following

{code}
./dev/change-version-to-2.10.sh
./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
-Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
{code}

and then ran the following code in a pyspark shell

{code}
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, StructField, StructType
from pyspark.sql.functions import udf
from pyspark.sql.types import Row
spark = SparkSession.builder.master('local[4]').appName('2.0 DF').getOrCreate()
add_one = udf(lambda x: x + 1, IntegerType())
schema = StructType([StructField('a', IntegerType(), False)])
df = spark.createDataFrame([Row(a=1),Row(a=2)], schema)
df.select(add_one(df.a).alias('incremented')).collect()
{code}

This never returns with a result. 



> Python UDFs do not work in Spark 2.0-preview built with scala 2.10
> --
>
> Key: SPARK-15811
> URL: https://issues.apache.org/jira/browse/SPARK-15811
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Franklyn Dsouza
>Assignee: Davies Liu
>Priority: Blocker
>
> I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following
> {code}
> ./dev/change-version-to-2.10.sh
> ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
> -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
> {code}
> and then ran the following code in a pyspark shell
> {code}
> from pyspark.sql.types import IntegerType, StructField, StructType
> from pyspark.sql.functions import udf
> from pyspark.sql.types import Row
> spark = SparkSession.builder.master('local[4]').appName('2.0 
> DF').getOrCreate()
> add_one = udf(lambda x: x + 1, IntegerType())
> schema = StructType([StructField('a', IntegerType(), False)])
> df = sqlContext.createDataFrame([Row(a=1),Row(a=2)], schema)
> df.select(add_one(df.a).alias('incremented')).collect()
> {code}
> This never returns with a result. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15811) Python UDFs do not work in Spark 2.0-preview built with scala 2.10

2016-06-15 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15811:
---
Summary: Python UDFs do not work in Spark 2.0-preview built with scala 2.10 
 (was: UDFs do not work in Spark 2.0-preview built with scala 2.10)

> Python UDFs do not work in Spark 2.0-preview built with scala 2.10
> --
>
> Key: SPARK-15811
> URL: https://issues.apache.org/jira/browse/SPARK-15811
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Franklyn Dsouza
>Assignee: Davies Liu
>Priority: Blocker
>
> I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following
> {code}
> ./dev/change-version-to-2.10.sh
> ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
> -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
> {code}
> and then ran the following code in a pyspark shell
> {code}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import IntegerType, StructField, StructType
> from pyspark.sql.functions import udf
> from pyspark.sql.types import Row
> spark = SparkSession.builder.master('local[4]').appName('2.0 
> DF').getOrCreate()
> add_one = udf(lambda x: x + 1, IntegerType())
> schema = StructType([StructField('a', IntegerType(), False)])
> df = spark.createDataFrame([Row(a=1),Row(a=2)], schema)
> df.select(add_one(df.a).alias('incremented')).collect()
> {code}
> This never returns with a result. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15811) UDFs do not work in Spark 2.0-preview built with scala 2.10

2016-06-15 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-15811:
--

Assignee: Davies Liu

> UDFs do not work in Spark 2.0-preview built with scala 2.10
> ---
>
> Key: SPARK-15811
> URL: https://issues.apache.org/jira/browse/SPARK-15811
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Franklyn Dsouza
>Assignee: Davies Liu
>Priority: Blocker
>
> I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following
> {code}
> ./dev/change-version-to-2.10.sh
> ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
> -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
> {code}
> and then ran the following code in a pyspark shell
> {code}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import IntegerType, StructField, StructType
> from pyspark.sql.functions import udf
> from pyspark.sql.types import Row
> spark = SparkSession.builder.master('local[4]').appName('2.0 
> DF').getOrCreate()
> add_one = udf(lambda x: x + 1, IntegerType())
> schema = StructType([StructField('a', IntegerType(), False)])
> df = spark.createDataFrame([Row(a=1),Row(a=2)], schema)
> df.select(add_one(df.a).alias('incremented')).collect()
> {code}
> This never returns with a result. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15811) UDFs do not work in Spark 2.0-preview built with scala 2.10

2016-06-15 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15811:
---
Priority: Blocker  (was: Critical)

> UDFs do not work in Spark 2.0-preview built with scala 2.10
> ---
>
> Key: SPARK-15811
> URL: https://issues.apache.org/jira/browse/SPARK-15811
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Franklyn Dsouza
>Priority: Blocker
>
> I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following
> {code}
> ./dev/change-version-to-2.10.sh
> ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
> -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
> {code}
> and then ran the following code in a pyspark shell
> {code}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import IntegerType, StructField, StructType
> from pyspark.sql.functions import udf
> from pyspark.sql.types import Row
> spark = SparkSession.builder.master('local[4]').appName('2.0 
> DF').getOrCreate()
> add_one = udf(lambda x: x + 1, IntegerType())
> schema = StructType([StructField('a', IntegerType(), False)])
> df = spark.createDataFrame([Row(a=1),Row(a=2)], schema)
> df.select(add_one(df.a).alias('incremented')).collect()
> {code}
> This never returns with a result. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15888) Python UDF over aggregate fails

2016-06-14 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15888:
---
Priority: Blocker  (was: Major)

> Python UDF over aggregate fails
> ---
>
> Key: SPARK-15888
> URL: https://issues.apache.org/jira/browse/SPARK-15888
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
>Priority: Blocker
>
> This looks like a regression from 1.6.1.
> The following notebook runs without error in a Spark 1.6.1 cluster, but fails 
> in 2.0.0:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15888) Python UDF over aggregate fails

2016-06-14 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-15888:
--

Assignee: Davies Liu

> Python UDF over aggregate fails
> ---
>
> Key: SPARK-15888
> URL: https://issues.apache.org/jira/browse/SPARK-15888
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
>Assignee: Davies Liu
>Priority: Blocker
>
> This looks like a regression from 1.6.1.
> The following notebook runs without error in a Spark 1.6.1 cluster, but fails 
> in 2.0.0:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15613) Incorrect days to millis conversion

2016-06-13 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-15613:
--

Assignee: Davies Liu

> Incorrect days to millis conversion 
> 
>
> Key: SPARK-15613
> URL: https://issues.apache.org/jira/browse/SPARK-15613
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
> Environment: java version "1.8.0_91"
>Reporter: Dmitry Bushev
>Assignee: Davies Liu
>Priority: Critical
>
> There is an issue with {{DateTimeUtils.daysToMillis}} implementation. It  
> affects {{DateTimeUtils.toJavaDate}} and ultimately CatalystTypeConverter, 
> i.e the conversion of date stored as {{Int}} days from epoch in InternalRow 
> to {{java.sql.Date}} of Row returned to user.
>  
> The issue can be reproduced with this test (all the following tests are in my 
> defalut timezone Europe/Moscow):
> {code}
> $ sbt -Duser.timezone=Europe/Moscow catalyst/console
> scala> java.util.Calendar.getInstance().getTimeZone
> res0: java.util.TimeZone = 
> sun.util.calendar.ZoneInfo[id="Europe/Moscow",offset=1080,dstSavings=0,useDaylight=false,transitions=79,lastRule=null]
> scala> import org.apache.spark.sql.catalyst.util.DateTimeUtils._
> import org.apache.spark.sql.catalyst.util.DateTimeUtils._
> scala> for (days <- 0 to 2 if millisToDays(daysToMillis(days)) != days) 
> yield days
> res23: scala.collection.immutable.IndexedSeq[Int] = Vector(4108, 4473, 4838, 
> 5204, 5568, 5932, 6296, 6660, 7024, 7388, 8053, 8487, 8851, 9215, 9586, 9950, 
> 10314, 10678, 11042, 11406, 11777, 12141, 12505, 12869, 13233, 13597, 13968, 
> 14332, 14696, 15060)
> {code}
> For example, for {{4108}} day of epoch, the correct date should be 
> {{1981-04-01}}
> {code}
> scala> DateTimeUtils.toJavaDate(4107)
> res25: java.sql.Date = 1981-03-31
> scala> DateTimeUtils.toJavaDate(4108)
> res26: java.sql.Date = 1981-03-31
> scala> DateTimeUtils.toJavaDate(4109)
> res27: java.sql.Date = 1981-04-02
> {code}
> There was previous unsuccessful attempt to work around the problem in 
> SPARK-11415. It seems that issue involves flaws in java date implementation 
> and I don't see how it can be fixed without third-party libraries.
> I was not able to identify the library of choice for Spark. The following 
> implementation uses [JSR-310|http://www.threeten.org/]
> {code}
> def millisToDays(millisUtc: Long): SQLDate = {
>   val instant = Instant.ofEpochMilli(millisUtc)
>   val zonedDateTime = instant.atZone(ZoneId.systemDefault)
>   zonedDateTime.toLocalDate.toEpochDay.toInt
> }
> def daysToMillis(days: SQLDate): Long = {
>   val localDate = LocalDate.ofEpochDay(days)
>   val zonedDateTime = localDate.atStartOfDay(ZoneId.systemDefault)
>   zonedDateTime.toInstant.toEpochMilli
> }
> {code}
> that produces correct results:
> {code}
> scala> for (days <- 0 to 2 if millisToDays(daysToMillis(days)) != days) 
> yield days
> res37: scala.collection.immutable.IndexedSeq[Int] = Vector()
> scala> new java.sql.Date(daysToMillis(4108))
> res36: java.sql.Date = 1981-04-01
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15896) Clean shuffle files after finish the SQL query

2016-06-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-15896:
--

 Summary: Clean shuffle files after finish the SQL query 
 Key: SPARK-15896
 URL: https://issues.apache.org/jira/browse/SPARK-15896
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu


The ShuffleRDD in a SQL query could not be reuse later, we could remove the 
shuffle files after finish a query to free the disk space as soon as possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15888) Python UDF over aggregate fails

2016-06-10 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325733#comment-15325733
 ] 

Davies Liu commented on SPARK-15888:


After some investigation, it turned out to be that the Python UDF over 
aggregate function could not be extracted and inserted BEFORE the aggregate, 
should be insert AFTER aggregate.

A logical aggregate will become multiple physical aggregates, maybe it's better 
to add another rule for logical plan  (keep the current rule for physical plan).

> Python UDF over aggregate fails
> ---
>
> Key: SPARK-15888
> URL: https://issues.apache.org/jira/browse/SPARK-15888
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
>
> This looks like a regression from 1.6.1.
> The following notebook runs without error in a Spark 1.6.1 cluster, but fails 
> in 2.0.0:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15888) Python UDF over aggregate fails

2016-06-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15888:
---
Summary: Python UDF over aggregate fails  (was: UDF fails in Python)

> Python UDF over aggregate fails
> ---
>
> Key: SPARK-15888
> URL: https://issues.apache.org/jira/browse/SPARK-15888
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
>
> This looks like a regression from 1.6.1.
> The following notebook runs without error in a Spark 1.6.1 cluster, but fails 
> in 2.0.0:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15759) Fallback to non-codegen if fail to compile generated code

2016-06-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15759.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13501
[https://github.com/apache/spark/pull/13501]

> Fallback to non-codegen if fail to compile generated code
> -
>
> Key: SPARK-15759
> URL: https://issues.apache.org/jira/browse/SPARK-15759
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> If anything go wrong on whole-stage codegen, we should temporary disable it 
> for part of the query to make sure that the query could ran.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15678) Not use cache on appends and overwrites

2016-06-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15678:
---
Assignee: Sameer Agarwal

> Not use cache on appends and overwrites
> ---
>
> Key: SPARK-15678
> URL: https://issues.apache.org/jira/browse/SPARK-15678
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.0.0
>
>
> SparkSQL currently doesn't drop caches if the underlying data is overwritten.
> {code}
> val dir = "/tmp/test"
> sqlContext.range(1000).write.mode("overwrite").parquet(dir)
> val df = sqlContext.read.parquet(dir).cache()
> df.count() // outputs 1000
> sqlContext.range(10).write.mode("overwrite").parquet(dir)
> sqlContext.read.parquet(dir).count() // outputs 1000 instead of 10 < We 
> are still using the cached dataset
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15678) Not use cache on appends and overwrites

2016-06-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15678.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13566
[https://github.com/apache/spark/pull/13566]

> Not use cache on appends and overwrites
> ---
>
> Key: SPARK-15678
> URL: https://issues.apache.org/jira/browse/SPARK-15678
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Sameer Agarwal
> Fix For: 2.0.0
>
>
> SparkSQL currently doesn't drop caches if the underlying data is overwritten.
> {code}
> val dir = "/tmp/test"
> sqlContext.range(1000).write.mode("overwrite").parquet(dir)
> val df = sqlContext.read.parquet(dir).cache()
> df.count() // outputs 1000
> sqlContext.range(10).write.mode("overwrite").parquet(dir)
> sqlContext.read.parquet(dir).count() // outputs 1000 instead of 10 < We 
> are still using the cached dataset
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String

2016-06-10 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325417#comment-15325417
 ] 

Davies Liu commented on SPARK-15822:


The latest stacktrace is different than previous one, it seems that the 
UnsafeRow in aggregate hash map is corrupt.

How large is your dataset? It will be great if we can reproduce it (currently 
no luck).

> segmentation violation in o.a.s.unsafe.types.UTF8String 
> 
>
> Key: SPARK-15822
> URL: https://issues.apache.org/jira/browse/SPARK-15822
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: linux amd64
> openjdk version "1.8.0_91"
> OpenJDK Runtime Environment (build 1.8.0_91-b14)
> OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
>Reporter: Pete Robbins
>Assignee: Herman van Hovell
>Priority: Blocker
>
> Executors fail with segmentation violation while running application with
> spark.memory.offHeap.enabled true
> spark.memory.offHeap.size 512m
> Also now reproduced with 
> spark.memory.offHeap.enabled false
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400
> #
> # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # J 4816 C2 
> org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I
>  (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d]
> {noformat}
> We initially saw this on IBM java on PowerPC box but is recreatable on linux 
> with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the 
> same code point:
> {noformat}
> 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48)
> java.lang.NullPointerException
>   at 
> org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831)
>   at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
>   at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664)
>   at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.lang.Thread.run(Thread.java:785)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15654) Reading gzipped files results in duplicate rows

2016-06-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15654.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13531
[https://github.com/apache/spark/pull/13531]

> Reading gzipped files results in duplicate rows
> ---
>
> Key: SPARK-15654
> URL: https://issues.apache.org/jira/browse/SPARK-15654
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jurriaan Pruis
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 2.0.0
>
>
> When gzipped files are larger then {{spark.sql.files.maxPartitionBytes}} 
> reading the file will result in duplicate rows in the dataframe.
> Given an example gzipped wordlist (of 740K bytes):
> {code}
> $ gzcat words.gz |wc -l
> 235886
> {code}
> Reading it using spark results in the following output:
> {code}
> >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '1000')
> >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count()
> 81244093
> >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '1')
> >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count()
> 8348566
> >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '10')
> >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count()
> 1051469
> >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '100')
> >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count()
> 235886
> {code}
> You can clearly see how the number of rows scales with the number of 
> partitions.
> Somehow the data is duplicated when the number of partitions exceeds one 
> (which as seen above approximately scales with the partition size). 
> When using distinct you'll get the correct answer:
> {code}
> >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '1')
> >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").distinct().count()
> 235886
> {code}
> This looks like a pretty serious bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15825) sort-merge-join gives invalid results when joining on a tupled key

2016-06-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15825.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13589
[https://github.com/apache/spark/pull/13589]

> sort-merge-join gives invalid results when joining on a tupled key
> --
>
> Key: SPARK-15825
> URL: https://issues.apache.org/jira/browse/SPARK-15825
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: spark 2.0.0-SNAPSHOT
>Reporter: Andres Perez
>Assignee: Herman van Hovell
> Fix For: 2.0.0
>
>
> {noformat}
>   import org.apache.spark.sql.functions
>   val left = List("0", "1", "2").toDS()
> .map{ k => ((k, 0), "l") }
>   val right = List("0", "1", "2").toDS()
> .map{ k => ((k, 0), "r") }
>   val result = left.toDF("k", "v").as[((String, Int), String)].alias("left")
> .joinWith(right.toDF("k", "v").as[((String, Int), 
> String)].alias("right"), functions.col("left.k") === 
> functions.col("right.k"), "inner")
> .as[(((String, Int), String), ((String, Int), String))]
> {noformat}
> When broadcast joins are enabled, we get the expected output:
> {noformat}
> (((0,0),l),((0,0),r))
> (((1,0),l),((1,0),r))
> (((2,0),l),((2,0),r))
> {noformat}
> However, when broadcast joins are disabled (i.e. setting 
> spark.sql.autoBroadcastJoinThreshold to -1), the result is incorrect:
> {noformat}
> (((2,0),l),((2,-1),))
> (((0,0),l),((0,-313907893),))
> (((1,0),l),((null,-313907893),))
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String

2016-06-10 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15324907#comment-15324907
 ] 

Davies Liu commented on SPARK-15822:


SortMergeJoin assume that the keys do not have null in them (we insert an 
IsNotNull predicate for inner join), it seems that there is something wrong 
here, could you post the query plan?

> segmentation violation in o.a.s.unsafe.types.UTF8String 
> 
>
> Key: SPARK-15822
> URL: https://issues.apache.org/jira/browse/SPARK-15822
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: linux amd64
> openjdk version "1.8.0_91"
> OpenJDK Runtime Environment (build 1.8.0_91-b14)
> OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
>Reporter: Pete Robbins
>Assignee: Herman van Hovell
>Priority: Blocker
>
> Executors fail with segmentation violation while running application with
> spark.memory.offHeap.enabled true
> spark.memory.offHeap.size 512m
> Also now reproduced with 
> spark.memory.offHeap.enabled false
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400
> #
> # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # J 4816 C2 
> org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I
>  (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d]
> {noformat}
> We initially saw this on IBM java on PowerPC box but is recreatable on linux 
> with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the 
> same code point:
> {noformat}
> 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48)
> java.lang.NullPointerException
>   at 
> org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831)
>   at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
>   at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664)
>   at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.lang.Thread.run(Thread.java:785)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String

2016-06-10 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15324875#comment-15324875
 ] 

Davies Liu commented on SPARK-15822:


Could you try to disable whole-stage codegen to see whether this is related to 
that or not?

> segmentation violation in o.a.s.unsafe.types.UTF8String 
> 
>
> Key: SPARK-15822
> URL: https://issues.apache.org/jira/browse/SPARK-15822
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: linux amd64
> openjdk version "1.8.0_91"
> OpenJDK Runtime Environment (build 1.8.0_91-b14)
> OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
>Reporter: Pete Robbins
>Assignee: Herman van Hovell
>Priority: Blocker
>
> Executors fail with segmentation violation while running application with
> spark.memory.offHeap.enabled true
> spark.memory.offHeap.size 512m
> Also now reproduced with 
> spark.memory.offHeap.enabled false
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400
> #
> # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # J 4816 C2 
> org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I
>  (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d]
> {noformat}
> We initially saw this on IBM java on PowerPC box but is recreatable on linux 
> with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the 
> same code point:
> {noformat}
> 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48)
> java.lang.NullPointerException
>   at 
> org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831)
>   at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
>   at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664)
>   at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.lang.Thread.run(Thread.java:785)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15433) PySpark core test should not use SerDe from PythonMLLibAPI

2016-06-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15433:
---
Assignee: Liang-Chi Hsieh

> PySpark core test should not use SerDe from PythonMLLibAPI
> --
>
> Key: SPARK-15433
> URL: https://issues.apache.org/jira/browse/SPARK-15433
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently PySpark core test uses the SerDe from PythonMLLibAPI which includes 
> many MLlib things. It should use SerDeUtil instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14670) Allow updating SQLMetrics on driver

2016-06-08 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-14670:
---
Assignee: Wenchen Fan  (was: Andrew Or)

> Allow updating SQLMetrics on driver
> ---
>
> Key: SPARK-14670
> URL: https://issues.apache.org/jira/browse/SPARK-14670
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>
> On the SparkUI right now we have this SQLTab that displays accumulator values 
> per operator. However, it only displays metrics updated on the executors, not 
> on the driver. It is useful to also include driver metrics, e.g. broadcast 
> time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14670) Allow updating SQLMetrics on driver

2016-06-08 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14670.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13189
[https://github.com/apache/spark/pull/13189]

> Allow updating SQLMetrics on driver
> ---
>
> Key: SPARK-14670
> URL: https://issues.apache.org/jira/browse/SPARK-14670
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> On the SparkUI right now we have this SQLTab that displays accumulator values 
> per operator. However, it only displays metrics updated on the executors, not 
> on the driver. It is useful to also include driver metrics, e.g. broadcast 
> time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15791) NPE in ScalarSubquery

2016-06-06 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15791:
---
Assignee: Eric Liang  (was: Davies Liu)

> NPE in ScalarSubquery
> -
>
> Key: SPARK-15791
> URL: https://issues.apache.org/jira/browse/SPARK-15791
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Eric Liang
>
> {code}
> Job aborted due to stage failure: Task 0 in stage 146.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 146.0 (TID 48828, 10.0.206.208): 
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.ScalarSubquery.dataType(subquery.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.CaseWhenBase.dataType(conditionalExpressions.scala:103)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:165)
>   at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$output$1.apply(basicPhysicalOperators.scala:33)
>   at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$output$1.apply(basicPhysicalOperators.scala:33)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.execution.ProjectExec.output(basicPhysicalOperators.scala:33)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.output(WholeStageCodegenExec.scala:291)
>   at 
> org.apache.spark.sql.execution.DeserializeToObjectExec$$anonfun$2.apply(objects.scala:85)
>   at 
> org.apache.spark.sql.execution.DeserializeToObjectExec$$anonfun$2.apply(objects.scala:84)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:775)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:775)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15791) NPE in ScalarSubquery

2016-06-06 Thread Davies Liu (JIRA)
Davies Liu created SPARK-15791:
--

 Summary: NPE in ScalarSubquery
 Key: SPARK-15791
 URL: https://issues.apache.org/jira/browse/SPARK-15791
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


{code}
Job aborted due to stage failure: Task 0 in stage 146.0 failed 4 times, most 
recent failure: Lost task 0.3 in stage 146.0 (TID 48828, 10.0.206.208): 
java.lang.NullPointerException
at 
org.apache.spark.sql.execution.ScalarSubquery.dataType(subquery.scala:45)
at 
org.apache.spark.sql.catalyst.expressions.CaseWhenBase.dataType(conditionalExpressions.scala:103)
at 
org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:165)
at 
org.apache.spark.sql.execution.ProjectExec$$anonfun$output$1.apply(basicPhysicalOperators.scala:33)
at 
org.apache.spark.sql.execution.ProjectExec$$anonfun$output$1.apply(basicPhysicalOperators.scala:33)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.execution.ProjectExec.output(basicPhysicalOperators.scala:33)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.output(WholeStageCodegenExec.scala:291)
at 
org.apache.spark.sql.execution.DeserializeToObjectExec$$anonfun$2.apply(objects.scala:85)
at 
org.apache.spark.sql.execution.DeserializeToObjectExec$$anonfun$2.apply(objects.scala:84)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:775)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:775)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15654) Reading gzipped files results in duplicate rows

2016-06-06 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-15654:
--

Assignee: Davies Liu  (was: Takeshi Yamamuro)

> Reading gzipped files results in duplicate rows
> ---
>
> Key: SPARK-15654
> URL: https://issues.apache.org/jira/browse/SPARK-15654
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jurriaan Pruis
>Assignee: Davies Liu
>Priority: Blocker
>
> When gzipped files are larger then {{spark.sql.files.maxPartitionBytes}} 
> reading the file will result in duplicate rows in the dataframe.
> Given an example gzipped wordlist (of 740K bytes):
> {code}
> $ gzcat words.gz |wc -l
> 235886
> {code}
> Reading it using spark results in the following output:
> {code}
> >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '1000')
> >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count()
> 81244093
> >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '1')
> >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count()
> 8348566
> >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '10')
> >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count()
> 1051469
> >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '100')
> >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count()
> 235886
> {code}
> You can clearly see how the number of rows scales with the number of 
> partitions.
> Somehow the data is duplicated when the number of partitions exceeds one 
> (which as seen above approximately scales with the partition size). 
> When using distinct you'll get the correct answer:
> {code}
> >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '1')
> >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").distinct().count()
> 235886
> {code}
> This looks like a pretty serious bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15391) Spark executor OOM during TimSort

2016-06-03 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15391.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13318
[https://github.com/apache/spark/pull/13318]

> Spark executor OOM during TimSort
> -
>
> Key: SPARK-15391
> URL: https://issues.apache.org/jira/browse/SPARK-15391
> Project: Spark
>  Issue Type: Bug
>Reporter: Sital Kedia
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> While running a query, we are seeing a lot of executor OOM while doing 
> TimSort.
> Stack trace - 
> {code}
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSortDataFormat.allocate(UnsafeSortDataFormat.java:86)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSortDataFormat.allocate(UnsafeSortDataFormat.java:32)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.ensureCapacity(TimSort.java:951)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.mergeLo(TimSort.java:699)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.mergeAt(TimSort.java:525)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.mergeCollapse(TimSort.java:453)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.access$200(TimSort.java:325)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:153)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:235)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:198)
>   at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:58)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:356)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91)
> {code}
> Out of total 32g available to the executors, we are allocating 24g onheap and 
> 8g onheap memory. Looking at the code 
> (https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSortDataFormat.java#L87),
>  we see that during TimSort we are allocating the memory buffer onheap 
> irrespective of memory mode that is reason of executor OOM. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15759) Fallback to non-codegen if fail to compile generated code

2016-06-03 Thread Davies Liu (JIRA)
Davies Liu created SPARK-15759:
--

 Summary: Fallback to non-codegen if fail to compile generated code
 Key: SPARK-15759
 URL: https://issues.apache.org/jira/browse/SPARK-15759
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Davies Liu
Assignee: Davies Liu


If anything go wrong on whole-stage codegen, we should temporary disable it for 
part of the query to make sure that the query could ran.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15671) performance regression CoalesceRDD large # partitions

2016-06-01 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15671.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13443
[https://github.com/apache/spark/pull/13443]

> performance regression CoalesceRDD large # partitions
> -
>
> Key: SPARK-15671
> URL: https://issues.apache.org/jira/browse/SPARK-15671
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Priority: Critical
> Fix For: 2.0.0
>
>
> I was running a 15TB join job with 202000 partitions. It looks like the 
> changes I made to CoalesceRDD in pickBin() are really slow with that large of 
> partitions.  The array filter with that many elements just takes to long.
>  It took about an hour for it to pickBins for all the partitions.
> original change:
> https://github.com/apache/spark/commit/83ee92f60345f016a390d61a82f1d924f64ddf90
> Just reverting the pickBin code back to get currpreflocs fixes the issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15671) performance regression CoalesceRDD large # partitions

2016-06-01 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15671:
---
Assignee: Thomas Graves

> performance regression CoalesceRDD large # partitions
> -
>
> Key: SPARK-15671
> URL: https://issues.apache.org/jira/browse/SPARK-15671
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Critical
> Fix For: 2.0.0
>
>
> I was running a 15TB join job with 202000 partitions. It looks like the 
> changes I made to CoalesceRDD in pickBin() are really slow with that large of 
> partitions.  The array filter with that many elements just takes to long.
>  It took about an hour for it to pickBins for all the partitions.
> original change:
> https://github.com/apache/spark/commit/83ee92f60345f016a390d61a82f1d924f64ddf90
> Just reverting the pickBin code back to get currpreflocs fixes the issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15557) expression ((cast(99 as decimal) + '3') * '2.3' ) return null

2016-05-31 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15557:
---
Assignee: Dilip Biswal

> expression ((cast(99 as decimal) + '3') * '2.3' ) return null
> -
>
> Key: SPARK-15557
> URL: https://issues.apache.org/jira/browse/SPARK-15557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
> Environment: spark1.6.1
>Reporter: cen yuhai
>Assignee: Dilip Biswal
> Fix For: 2.0.0
>
>
> expression "select  (cast(99 as decimal(19,6))+ '3')*'2.3' " will return null
> expression "select  (cast(40 as decimal(19,6))+ '3')*'2.3' "  is OK
> I find that maybe it will be null if the result is more than 100



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15557) expression ((cast(99 as decimal) + '3') * '2.3' ) return null

2016-05-31 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15557.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13368
[https://github.com/apache/spark/pull/13368]

> expression ((cast(99 as decimal) + '3') * '2.3' ) return null
> -
>
> Key: SPARK-15557
> URL: https://issues.apache.org/jira/browse/SPARK-15557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
> Environment: spark1.6.1
>Reporter: cen yuhai
> Fix For: 2.0.0
>
>
> expression "select  (cast(99 as decimal(19,6))+ '3')*'2.3' " will return null
> expression "select  (cast(40 as decimal(19,6))+ '3')*'2.3' "  is OK
> I find that maybe it will be null if the result is more than 100



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15327) Catalyst code generation fails with complex data structure

2016-05-31 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15327.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13235
[https://github.com/apache/spark/pull/13235]

> Catalyst code generation fails with complex data structure
> --
>
> Key: SPARK-15327
> URL: https://issues.apache.org/jira/browse/SPARK-15327
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jurriaan Pruis
>Assignee: Davies Liu
> Fix For: 2.0.0
>
> Attachments: full_exception.txt
>
>
> Spark code generation fails with the following error when loading parquet 
> files with a complex structure:
> {code}
> : java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 158, Column 16: Expression "scan_isNull" is not an 
> rvalue
> {code}
> The generated code on line 158 looks like:
> {code}
> /* 153 */ this.scan_arrayWriter23 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 154 */ this.scan_rowWriter40 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(scan_holder,
>  1);
> /* 155 */   }
> /* 156 */   
> /* 157 */   private void scan_apply_0(InternalRow scan_row) {
> /* 158 */ if (scan_isNull) {
> /* 159 */   scan_rowWriter.setNullAt(0);
> /* 160 */ } else {
> /* 161 */   // Remember the current cursor so that we can calculate how 
> many bytes are
> /* 162 */   // written later.
> /* 163 */   final int scan_tmpCursor = scan_holder.cursor;
> /* 164 */   
> {code}
> How to reproduce (Pyspark): 
> {code}
> # Some complex structure
> json = '{"h": {"b": {"c": [{"e": "adfgd"}], "a": [{"e": "testing", "count": 
> 3}], "b": [{"e": "test", "count": 1}]}}, "d": {"b": {"c": [{"e": "adfgd"}], 
> "a": [{"e": "testing", "count": 3}], "b": [{"e": "test", "count": 1}]}}, "c": 
> {"b": {"c": [{"e": "adfgd"}], "a": [{"count": 3}], "b": [{"e": "test", 
> "count": 1}]}}, "a": {"b": {"c": [{"e": "adfgd"}], "a": [{"count": 3}], "b": 
> [{"e": "test", "count": 1}]}}, "e": {"b": {"c": [{"e": "adfgd"}], "a": [{"e": 
> "testing", "count": 3}], "b": [{"e": "test", "count": 1}]}}, "g": {"b": {"c": 
> [{"e": "adfgd"}], "a": [{"e": "testing", "count": 3}], "b": [{"e": "test", 
> "count": 1}]}}, "f": {"b": {"c": [{"e": "adfgd"}], "a": [{"e": "testing", 
> "count": 3}], "b": [{"e": "test", "count": 1}]}}, "b": {"b": {"c": [{"e": 
> "adfgd"}], "a": [{"count": 3}], "b": [{"e": "test", "count": 1}]}}}'
> # Write to parquet file
> sqlContext.read.json(sparkContext.parallelize([json])).write.mode('overwrite').parquet('test')
> # Try to read from parquet file (this generates an exception)
> sqlContext.read.parquet('test').collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11293) Spillable collections leak shuffle memory

2016-05-31 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308673#comment-15308673
 ] 

Davies Liu commented on SPARK-11293:


I think your patch is not related to this bug, right?

> Spillable collections leak shuffle memory
> -
>
> Key: SPARK-11293
> URL: https://issues.apache.org/jira/browse/SPARK-11293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0, 1.6.1
>Reporter: Josh Rosen
>Assignee: Davies Liu
>Priority: Critical
>
> I discovered multiple leaks of shuffle memory while working on my memory 
> manager consolidation patch, which added the ability to do strict memory leak 
> detection for the bookkeeping that used to be performed by the 
> ShuffleMemoryManager. This uncovered a handful of places where tasks can 
> acquire execution/shuffle memory but never release it, starving themselves of 
> memory.
> Problems that I found:
> * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution 
> memory.
> * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a 
> {{CompletionIterator}}.
> * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing 
> its resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11293) Spillable collections leak shuffle memory

2016-05-31 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-11293:
--

Assignee: Davies Liu

> Spillable collections leak shuffle memory
> -
>
> Key: SPARK-11293
> URL: https://issues.apache.org/jira/browse/SPARK-11293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0, 1.6.1
>Reporter: Josh Rosen
>Assignee: Davies Liu
>Priority: Critical
>
> I discovered multiple leaks of shuffle memory while working on my memory 
> manager consolidation patch, which added the ability to do strict memory leak 
> detection for the bookkeeping that used to be performed by the 
> ShuffleMemoryManager. This uncovered a handful of places where tasks can 
> acquire execution/shuffle memory but never release it, starving themselves of 
> memory.
> Problems that I found:
> * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution 
> memory.
> * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a 
> {{CompletionIterator}}.
> * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing 
> its resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15568) TimSort and RadixSort can't support more than 1 billions elements

2016-05-26 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15568:
---
Summary: TimSort and RadixSort can't support more than 1 billions elements  
(was: TimSort and RadixSort can't support more than 2 billions elements)

> TimSort and RadixSort can't support more than 1 billions elements
> -
>
> Key: SPARK-15568
> URL: https://issues.apache.org/jira/browse/SPARK-15568
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Both TimSort and RadixSort using int as the type for index and length, it 
> will overflow when there are more than 2 billions elements on the array.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15568) TimSort and RadixSort can't support more than 2 billions elements

2016-05-26 Thread Davies Liu (JIRA)
Davies Liu created SPARK-15568:
--

 Summary: TimSort and RadixSort can't support more than 2 billions 
elements
 Key: SPARK-15568
 URL: https://issues.apache.org/jira/browse/SPARK-15568
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Davies Liu
Assignee: Davies Liu


Both TimSort and RadixSort using int as the type for index and length, it will 
overflow when there are more than 2 billions elements on the array.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15554) Duplicated executors in Spark UI

2016-05-26 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15301627#comment-15301627
 ] 

Davies Liu commented on SPARK-15554:


cc [~zsxwing]

> Duplicated executors in Spark UI
> 
>
> Key: SPARK-15554
> URL: https://issues.apache.org/jira/browse/SPARK-15554
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Davies Liu
>
> In the executors tab of Spark UI, there is one executor was accounted and 
> listed twice:
> {code}
> 6 10.0.201.187:60873  Dead2   58.7 KB / 15.3 GB   0.0 B   
> 4   0   5   61  66  13.61 h (15.8 m)0.0 B   6.6 
> GB  48.0 GB 
> stdout
> stderr
> 6 10.0.201.187:60873  Dead3   77.1 KB / 15.3 GB   0.0 B   
> 4   0   5   61  66  13.61 h (15.8 m)0.0 B   6.6 
> GB  48.0 GB 
> stdout
> stderr
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15554) Duplicated executors in Spark UI

2016-05-26 Thread Davies Liu (JIRA)
Davies Liu created SPARK-15554:
--

 Summary: Duplicated executors in Spark UI
 Key: SPARK-15554
 URL: https://issues.apache.org/jira/browse/SPARK-15554
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Davies Liu


In the executors tab of Spark UI, there is one executor was accounted and 
listed twice:

{code}
6   10.0.201.187:60873  Dead2   58.7 KB / 15.3 GB   0.0 B   
4   0   5   61  66  13.61 h (15.8 m)0.0 B   6.6 GB  
48.0 GB 
stdout
stderr
6   10.0.201.187:60873  Dead3   77.1 KB / 15.3 GB   0.0 B   
4   0   5   61  66  13.61 h (15.8 m)0.0 B   6.6 GB  
48.0 GB 
stdout
stderr

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15332) OutOfMemory in TimSort

2016-05-25 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-15332:
--

Assignee: Davies Liu

> OutOfMemory in TimSort 
> ---
>
> Key: SPARK-15332
> URL: https://issues.apache.org/jira/browse/SPARK-15332
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o154.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 
> in stage 230.0 failed 1 times, most recent failure: Lost task 1.0 in stage 
> 230.0 (TID 1881, localhost): java.lang.OutOfMemoryError: Java heap space
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSortDataFormat.allocate(UnsafeSortDataFormat.java:88)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSortDataFormat.allocate(UnsafeSortDataFormat.java:32)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.ensureCapacity(TimSort.java:951)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.mergeLo(TimSort.java:699)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.mergeAt(TimSort.java:525)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.mergeCollapse(TimSort.java:453)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.access$200(TimSort.java:325)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:153)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:285)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:199)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:175)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:249)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:112)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:363)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:378)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:92)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:357)
>   at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoinExec.scala:736)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.(SortMergeJoinExec.scala:611)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1$$anon$2.(SortMergeJoinExec.scala:206)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1.apply(SortMergeJoinExec.scala:204)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1.apply(SortMergeJoinExec.scala:100)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15391) Spark executor OOM during TimSort

2016-05-25 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-15391:
--

Assignee: Davies Liu

> Spark executor OOM during TimSort
> -
>
> Key: SPARK-15391
> URL: https://issues.apache.org/jira/browse/SPARK-15391
> Project: Spark
>  Issue Type: Bug
>Reporter: Sital Kedia
>Assignee: Davies Liu
>
> While running a query, we are seeing a lot of executor OOM while doing 
> TimSort.
> Stack trace - 
> {code}
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSortDataFormat.allocate(UnsafeSortDataFormat.java:86)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSortDataFormat.allocate(UnsafeSortDataFormat.java:32)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.ensureCapacity(TimSort.java:951)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.mergeLo(TimSort.java:699)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.mergeAt(TimSort.java:525)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.mergeCollapse(TimSort.java:453)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.access$200(TimSort.java:325)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:153)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:235)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:198)
>   at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:58)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:356)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91)
> {code}
> Out of total 32g available to the executors, we are allocating 24g onheap and 
> 8g onheap memory. Looking at the code 
> (https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSortDataFormat.java#L87),
>  we see that during TimSort we are allocating the memory buffer onheap 
> irrespective of memory mode that is reason of executor OOM. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12795) Whole stage codegen

2016-05-24 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12795.

   Resolution: Fixed
Fix Version/s: 2.0.0

> Whole stage codegen
> ---
>
> Key: SPARK-12795
> URL: https://issues.apache.org/jira/browse/SPARK-12795
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> Whole stage codegen is used by some modern MPP databases to archive great 
> performance. See http://www.vldb.org/pvldb/vol4/p539-neumann.pdf
> For Spark SQL, we can compile multiple operator into a single Java function 
> to avoid the overhead from materialize rows and Scala iterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14748) BoundReference should not set ExprCode.code to empty string

2016-05-24 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-14748.
--
Resolution: Won't Fix

Won't fix this for now.

> BoundReference should not set ExprCode.code to empty string
> ---
>
> Key: SPARK-14748
> URL: https://issues.apache.org/jira/browse/SPARK-14748
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sameer Agarwal
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12949) Support common expression elimination

2016-05-24 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12949:
---
Assignee: Liang-Chi Hsieh

> Support common expression elimination
> -
>
> Key: SPARK-12949
> URL: https://issues.apache.org/jira/browse/SPARK-12949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12949) Support common expression elimination

2016-05-24 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12949.

   Resolution: Fixed
Fix Version/s: 2.0.0

> Support common expression elimination
> -
>
> Key: SPARK-12949
> URL: https://issues.apache.org/jira/browse/SPARK-12949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12949) Support common expression elimination

2016-05-24 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15298537#comment-15298537
 ] 

Davies Liu commented on SPARK-12949:


Common subexpress elimination in Aggregate are supported by SPARK-14951, so 
close this.

> Support common expression elimination
> -
>
> Key: SPARK-12949
> URL: https://issues.apache.org/jira/browse/SPARK-12949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13135) Don't print expressions recursively in generated code

2016-05-24 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-13135:
---
Assignee: Dongjoon Hyun

> Don't print expressions recursively in generated code
> -
>
> Key: SPARK-13135
> URL: https://issues.apache.org/jira/browse/SPARK-13135
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Dongjoon Hyun
>
> Our code generation currently prints expressions recursively. For example, 
> for expression "(1 + 1) + 1)", we would print the following:
> "(1 + 1) + 1)"
> "(1 + 1)"
> "1"
> "1"
> We should just print the project list once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13135) Don't print expressions recursively in generated code

2016-05-24 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13135.

  Resolution: Fixed
Target Version/s: 2.0.0

> Don't print expressions recursively in generated code
> -
>
> Key: SPARK-13135
> URL: https://issues.apache.org/jira/browse/SPARK-13135
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Dongjoon Hyun
>
> Our code generation currently prints expressions recursively. For example, 
> for expression "(1 + 1) + 1)", we would print the following:
> "(1 + 1) + 1)"
> "(1 + 1)"
> "1"
> "1"
> We should just print the project list once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15433) PySpark core test should not use SerDe from PythonMLLibAPI

2016-05-24 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15433.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 13214
[https://github.com/apache/spark/pull/13214]

> PySpark core test should not use SerDe from PythonMLLibAPI
> --
>
> Key: SPARK-15433
> URL: https://issues.apache.org/jira/browse/SPARK-15433
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Reporter: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently PySpark core test uses the SerDe from PythonMLLibAPI which includes 
> many MLlib things. It should use SerDeUtil instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14343) Dataframe operations on a partitioned dataset (using partition discovery) return invalid results

2016-05-23 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-14343:
---
Priority: Blocker  (was: Critical)

> Dataframe operations on a partitioned dataset (using partition discovery) 
> return invalid results
> 
>
> Key: SPARK-14343
> URL: https://issues.apache.org/jira/browse/SPARK-14343
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
> Environment: Mac OS X 10.11.4 / Ubuntu 16.04 LTS
>Reporter: Jurriaan Pruis
>Priority: Blocker
>
> When reading a dataset using {{sqlContext.read.text()}} queries on the 
> partitioned column return invalid results.
> h2. How to reproduce:
> h3. Generate datasets
> {code:title=repro.sh}
> #!/bin/sh
> mkdir -p dataset/year=2014
> mkdir -p dataset/year=2015
> echo "data from 2014" > dataset/year=2014/part01.txt
> echo "data from 2015" > dataset/year=2015/part01.txt
> {code}
> {code:title=repro2.sh}
> #!/bin/sh
> mkdir -p dataset2/month=june
> mkdir -p dataset2/month=july
> echo "data from june" > dataset2/month=june/part01.txt
> echo "data from july" > dataset2/month=july/part01.txt
> {code}
> h3. using first dataset
> {code:none}
> >>> df = sqlContext.read.text('dataset')
> ...
> >>> df
> DataFrame[value: string, year: int]
> >>> df.show()
> +--++
> | value|year|
> +--++
> |data from 2014|2014|
> |data from 2015|2015|
> +--++
> >>> df.select('year').show()
> ++
> |year|
> ++
> |  14|
> |  14|
> ++
> {code}
> This is clearly wrong. Seems like it returns the length of the value column?
> h3. using second dataset
> With another dataset it looks like this:
> {code:none}
> >>> df = sqlContext.read.text('dataset2')
> >>> df
> DataFrame[value: string, month: string]
> >>> df.show()
> +--+-+
> | value|month|
> +--+-+
> |data from june| june|
> |data from july| july|
> +--+-+
> >>> df.select('month').show()
> +--+
> | month|
> +--+
> |data from june|
> |data from july|
> +--+
> {code}
> Here it returns the value of the value column instead of the month partition.
> h3. Workaround
> When I convert the dataframe to an RDD and back to a DataFrame I get the 
> following result (which is the expected behaviour):
> {code:none}
> >>> df.rdd.toDF().select('month').show()
> +-+
> |month|
> +-+
> | june|
> | july|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14946) Spark 2.0 vs 1.6.1 Query Time(out)

2016-05-23 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296697#comment-15296697
 ] 

Davies Liu commented on SPARK-14946:


[~raymond.honderd...@sizmek.com] Thanks for the feedback, I'm just curious why 
it pick broadcast join but hang here. What's the size of bytes of two tables 
(the size of files on disk)?

> Spark 2.0 vs 1.6.1 Query Time(out)
> --
>
> Key: SPARK-14946
> URL: https://issues.apache.org/jira/browse/SPARK-14946
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Raymond Honderdors
>Priority: Critical
> Attachments: Query Plan 1.6.1.png, resolved timeout  long 
> duration.png, screenshot-spark_2.0.png, spark-defaults.conf, spark-env.sh, 
> version 1.6.1 screen 1 - thrift collect = true.png, version 1.6.1 screen 1 
> thrift collect = false.png, version 1.6.1 screen 2 thrift collect =false.png, 
> version 2.0 -screen 1 thrift collect = false.png, version 2.0 screen 2 thrift 
> collect = true.png, versiuon 2.0 screen 1 thrift collect = true.png
>
>
> I run a query using JDBC driver running it on version 1.6.1 it return after 5 
> – 6 min , the same query against version 2.0 fails after 2h (due to timeout) 
> for details on how to reproduce (also see comments below)
> here is what I tried
> I run the following query: select * from pe_servingdata sd inner join 
> pe_campaigns_gzip c on sd.campaignid = c.campaign_id ;
> (with and without a counter and group by on campaigne_id)
> I run spark 1.6.1 and Thriftserver
> then running the sql from beeline or squirrel, after a few min I get answer 
> (0 row) it is correct due to the fact my data did not have matching campaign 
> ids in both tables
> when I run spark 2.0 and Thriftserver, I once again run the sql statement and 
> after 2:30 min it gives up, bit already after 30/60 sec I stop seeing 
> activity on the spark ui
> (sorry for the delay in competing the description of the bug, I was on and 
> off work due to national holidays)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15441) dataset outer join seems to return incorrect result

2016-05-21 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15441:
---
Assignee: Wenchen Fan

> dataset outer join seems to return incorrect result
> ---
>
> Key: SPARK-15441
> URL: https://issues.apache.org/jira/browse/SPARK-15441
> Project: Spark
>  Issue Type: Bug
>  Components: sq;
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>Priority: Critical
>
> See notebook
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/2836020637783173/5382278320999420/latest.html
> {code}
> import org.apache.spark.sql.functions
> val left = List(("a", 1), ("a", 2), ("b", 3), ("c", 4)).toDS()
> val right = List(("a", "x"), ("b", "y"), ("d", "z")).toDS()
> // The last row _1 should be null, rather than (null, -1)
> left.toDF("k", "v").as[(String, Int)].alias("left")
>   .joinWith(right.toDF("k", "u").as[(String, String)].alias("right"), 
> functions.col("left.k") === functions.col("right.k"), "right_outer")
>   .show()
> {code}
> The returned result currently is
> {code}
> +-+-+
> |   _1|   _2|
> +-+-+
> |(a,2)|(a,x)|
> |(a,1)|(a,x)|
> |(b,3)|(b,y)|
> |(null,-1)|(d,z)|
> +-+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15441) dataset outer join seems to return incorrect result

2016-05-21 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294807#comment-15294807
 ] 

Davies Liu commented on SPARK-15441:


How to we represent a null in Dataset? If it's a row with all columns are 
nulls, then we could transform a row with all columns are nulls into null, 
right? In this case, the left side of the fourth row are nulls, it could be 
null.

> dataset outer join seems to return incorrect result
> ---
>
> Key: SPARK-15441
> URL: https://issues.apache.org/jira/browse/SPARK-15441
> Project: Spark
>  Issue Type: Bug
>  Components: sq;
>Reporter: Reynold Xin
>Priority: Critical
>
> See notebook
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/2836020637783173/5382278320999420/latest.html
> {code}
> import org.apache.spark.sql.functions
> val left = List(("a", 1), ("a", 2), ("b", 3), ("c", 4)).toDS()
> val right = List(("a", "x"), ("b", "y"), ("d", "z")).toDS()
> // The last row _1 should be null, rather than (null, -1)
> left.toDF("k", "v").as[(String, Int)].alias("left")
>   .joinWith(right.toDF("k", "u").as[(String, String)].alias("right"), 
> functions.col("left.k") === functions.col("right.k"), "right_outer")
>   .show()
> {code}
> The returned result currently is
> {code}
> +-+-+
> |   _1|   _2|
> +-+-+
> |(a,2)|(a,x)|
> |(a,1)|(a,x)|
> |(b,3)|(b,y)|
> |(null,-1)|(d,z)|
> +-+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15285) Generated SpecificSafeProjection.apply method grows beyond 64 KB

2016-05-21 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294785#comment-15294785
 ] 

Davies Liu commented on SPARK-15285:


[~kiszk] Go ahead, don't know why I can't assign this to you.

> Generated SpecificSafeProjection.apply method grows beyond 64 KB
> 
>
> Key: SPARK-15285
> URL: https://issues.apache.org/jira/browse/SPARK-15285
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Konstantin Shaposhnikov
>
> The following code snippet results in 
> {noformat}
>  org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Ljava/lang/Object;)Ljava/lang/Object;" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
> {noformat}
> {code}
> case class S100(s1:String="1", s2:String="2", s3:String="3", s4:String="4", 
> s5:String="5", s6:String="6", s7:String="7", s8:String="8", s9:String="9", 
> s10:String="10", s11:String="11", s12:String="12", s13:String="13", 
> s14:String="14", s15:String="15", s16:String="16", s17:String="17", 
> s18:String="18", s19:String="19", s20:String="20", s21:String="21", 
> s22:String="22", s23:String="23", s24:String="24", s25:String="25", 
> s26:String="26", s27:String="27", s28:String="28", s29:String="29", 
> s30:String="30", s31:String="31", s32:String="32", s33:String="33", 
> s34:String="34", s35:String="35", s36:String="36", s37:String="37", 
> s38:String="38", s39:String="39", s40:String="40", s41:String="41", 
> s42:String="42", s43:String="43", s44:String="44", s45:String="45", 
> s46:String="46", s47:String="47", s48:String="48", s49:String="49", 
> s50:String="50", s51:String="51", s52:String="52", s53:String="53", 
> s54:String="54", s55:String="55", s56:String="56", s57:String="57", 
> s58:String="58", s59:String="59", s60:String="60", s61:String="61", 
> s62:String="62", s63:String="63", s64:String="64", s65:String="65", 
> s66:String="66", s67:String="67", s68:String="68", s69:String="69", 
> s70:String="70", s71:String="71", s72:String="72", s73:String="73", 
> s74:String="74", s75:String="75", s76:String="76", s77:String="77", 
> s78:String="78", s79:String="79", s80:String="80", s81:String="81", 
> s82:String="82", s83:String="83", s84:String="84", s85:String="85", 
> s86:String="86", s87:String="87", s88:String="88", s89:String="89", 
> s90:String="90", s91:String="91", s92:String="92", s93:String="93", 
> s94:String="94", s95:String="95", s96:String="96", s97:String="97", 
> s98:String="98", s99:String="99", s100:String="100")
> case class S(s1: S100=S100(), s2: S100=S100(), s3: S100=S100(), s4: 
> S100=S100(), s5: S100=S100(), s6: S100=S100(), s7: S100=S100(), s8: 
> S100=S100(), s9: S100=S100(), s10: S100=S100())
> val ds = Seq(S(),S(),S()).toDS
> ds.show()
> {code}
> I could reproduce this with Spark built from 1.6 branch and with 
> https://home.apache.org/~pwendell/spark-nightly/spark-master-bin/spark-2.0.0-SNAPSHOT-2016_05_11_01_03-8beae59-bin/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15285) Generated SpecificSafeProjection.apply method grows beyond 64 KB

2016-05-21 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15285:
---
Assignee: (was: Wenchen Fan)

> Generated SpecificSafeProjection.apply method grows beyond 64 KB
> 
>
> Key: SPARK-15285
> URL: https://issues.apache.org/jira/browse/SPARK-15285
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Konstantin Shaposhnikov
>
> The following code snippet results in 
> {noformat}
>  org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Ljava/lang/Object;)Ljava/lang/Object;" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
> {noformat}
> {code}
> case class S100(s1:String="1", s2:String="2", s3:String="3", s4:String="4", 
> s5:String="5", s6:String="6", s7:String="7", s8:String="8", s9:String="9", 
> s10:String="10", s11:String="11", s12:String="12", s13:String="13", 
> s14:String="14", s15:String="15", s16:String="16", s17:String="17", 
> s18:String="18", s19:String="19", s20:String="20", s21:String="21", 
> s22:String="22", s23:String="23", s24:String="24", s25:String="25", 
> s26:String="26", s27:String="27", s28:String="28", s29:String="29", 
> s30:String="30", s31:String="31", s32:String="32", s33:String="33", 
> s34:String="34", s35:String="35", s36:String="36", s37:String="37", 
> s38:String="38", s39:String="39", s40:String="40", s41:String="41", 
> s42:String="42", s43:String="43", s44:String="44", s45:String="45", 
> s46:String="46", s47:String="47", s48:String="48", s49:String="49", 
> s50:String="50", s51:String="51", s52:String="52", s53:String="53", 
> s54:String="54", s55:String="55", s56:String="56", s57:String="57", 
> s58:String="58", s59:String="59", s60:String="60", s61:String="61", 
> s62:String="62", s63:String="63", s64:String="64", s65:String="65", 
> s66:String="66", s67:String="67", s68:String="68", s69:String="69", 
> s70:String="70", s71:String="71", s72:String="72", s73:String="73", 
> s74:String="74", s75:String="75", s76:String="76", s77:String="77", 
> s78:String="78", s79:String="79", s80:String="80", s81:String="81", 
> s82:String="82", s83:String="83", s84:String="84", s85:String="85", 
> s86:String="86", s87:String="87", s88:String="88", s89:String="89", 
> s90:String="90", s91:String="91", s92:String="92", s93:String="93", 
> s94:String="94", s95:String="95", s96:String="96", s97:String="97", 
> s98:String="98", s99:String="99", s100:String="100")
> case class S(s1: S100=S100(), s2: S100=S100(), s3: S100=S100(), s4: 
> S100=S100(), s5: S100=S100(), s6: S100=S100(), s7: S100=S100(), s8: 
> S100=S100(), s9: S100=S100(), s10: S100=S100())
> val ds = Seq(S(),S(),S()).toDS
> ds.show()
> {code}
> I could reproduce this with Spark built from 1.6 branch and with 
> https://home.apache.org/~pwendell/spark-nightly/spark-master-bin/spark-2.0.0-SNAPSHOT-2016_05_11_01_03-8beae59-bin/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15078) Add all TPCDS 1.4 benchmark queries for SparkSQL

2016-05-20 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15078.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 13188
[https://github.com/apache/spark/pull/13188]

> Add all TPCDS 1.4 benchmark queries for SparkSQL
> 
>
> Key: SPARK-15078
> URL: https://issues.apache.org/jira/browse/SPARK-15078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sameer Agarwal
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15327) Catalyst code generation fails with complex data structure

2016-05-20 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-15327:
--

Assignee: Davies Liu

> Catalyst code generation fails with complex data structure
> --
>
> Key: SPARK-15327
> URL: https://issues.apache.org/jira/browse/SPARK-15327
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jurriaan Pruis
>Assignee: Davies Liu
> Attachments: full_exception.txt
>
>
> Spark code generation fails with the following error when loading parquet 
> files with a complex structure:
> {code}
> : java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 158, Column 16: Expression "scan_isNull" is not an 
> rvalue
> {code}
> The generated code on line 158 looks like:
> {code}
> /* 153 */ this.scan_arrayWriter23 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 154 */ this.scan_rowWriter40 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(scan_holder,
>  1);
> /* 155 */   }
> /* 156 */   
> /* 157 */   private void scan_apply_0(InternalRow scan_row) {
> /* 158 */ if (scan_isNull) {
> /* 159 */   scan_rowWriter.setNullAt(0);
> /* 160 */ } else {
> /* 161 */   // Remember the current cursor so that we can calculate how 
> many bytes are
> /* 162 */   // written later.
> /* 163 */   final int scan_tmpCursor = scan_holder.cursor;
> /* 164 */   
> {code}
> How to reproduce (Pyspark): 
> {code}
> # Some complex structure
> json = '{"h": {"b": {"c": [{"e": "adfgd"}], "a": [{"e": "testing", "count": 
> 3}], "b": [{"e": "test", "count": 1}]}}, "d": {"b": {"c": [{"e": "adfgd"}], 
> "a": [{"e": "testing", "count": 3}], "b": [{"e": "test", "count": 1}]}}, "c": 
> {"b": {"c": [{"e": "adfgd"}], "a": [{"count": 3}], "b": [{"e": "test", 
> "count": 1}]}}, "a": {"b": {"c": [{"e": "adfgd"}], "a": [{"count": 3}], "b": 
> [{"e": "test", "count": 1}]}}, "e": {"b": {"c": [{"e": "adfgd"}], "a": [{"e": 
> "testing", "count": 3}], "b": [{"e": "test", "count": 1}]}}, "g": {"b": {"c": 
> [{"e": "adfgd"}], "a": [{"e": "testing", "count": 3}], "b": [{"e": "test", 
> "count": 1}]}}, "f": {"b": {"c": [{"e": "adfgd"}], "a": [{"e": "testing", 
> "count": 3}], "b": [{"e": "test", "count": 1}]}}, "b": {"b": {"c": [{"e": 
> "adfgd"}], "a": [{"count": 3}], "b": [{"e": "test", "count": 1}]}}}'
> # Write to parquet file
> sqlContext.read.json(sparkContext.parallelize([json])).write.mode('overwrite').parquet('test')
> # Try to read from parquet file (this generates an exception)
> sqlContext.read.parquet('test').collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15285) Generated SpecificSafeProjection.apply method grows beyond 64 KB

2016-05-20 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294178#comment-15294178
 ] 

Davies Liu commented on SPARK-15285:


cc [~cloud_fan]

> Generated SpecificSafeProjection.apply method grows beyond 64 KB
> 
>
> Key: SPARK-15285
> URL: https://issues.apache.org/jira/browse/SPARK-15285
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Konstantin Shaposhnikov
>Assignee: Wenchen Fan
>
> The following code snippet results in 
> {noformat}
>  org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Ljava/lang/Object;)Ljava/lang/Object;" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
> {noformat}
> {code}
> case class S100(s1:String="1", s2:String="2", s3:String="3", s4:String="4", 
> s5:String="5", s6:String="6", s7:String="7", s8:String="8", s9:String="9", 
> s10:String="10", s11:String="11", s12:String="12", s13:String="13", 
> s14:String="14", s15:String="15", s16:String="16", s17:String="17", 
> s18:String="18", s19:String="19", s20:String="20", s21:String="21", 
> s22:String="22", s23:String="23", s24:String="24", s25:String="25", 
> s26:String="26", s27:String="27", s28:String="28", s29:String="29", 
> s30:String="30", s31:String="31", s32:String="32", s33:String="33", 
> s34:String="34", s35:String="35", s36:String="36", s37:String="37", 
> s38:String="38", s39:String="39", s40:String="40", s41:String="41", 
> s42:String="42", s43:String="43", s44:String="44", s45:String="45", 
> s46:String="46", s47:String="47", s48:String="48", s49:String="49", 
> s50:String="50", s51:String="51", s52:String="52", s53:String="53", 
> s54:String="54", s55:String="55", s56:String="56", s57:String="57", 
> s58:String="58", s59:String="59", s60:String="60", s61:String="61", 
> s62:String="62", s63:String="63", s64:String="64", s65:String="65", 
> s66:String="66", s67:String="67", s68:String="68", s69:String="69", 
> s70:String="70", s71:String="71", s72:String="72", s73:String="73", 
> s74:String="74", s75:String="75", s76:String="76", s77:String="77", 
> s78:String="78", s79:String="79", s80:String="80", s81:String="81", 
> s82:String="82", s83:String="83", s84:String="84", s85:String="85", 
> s86:String="86", s87:String="87", s88:String="88", s89:String="89", 
> s90:String="90", s91:String="91", s92:String="92", s93:String="93", 
> s94:String="94", s95:String="95", s96:String="96", s97:String="97", 
> s98:String="98", s99:String="99", s100:String="100")
> case class S(s1: S100=S100(), s2: S100=S100(), s3: S100=S100(), s4: 
> S100=S100(), s5: S100=S100(), s6: S100=S100(), s7: S100=S100(), s8: 
> S100=S100(), s9: S100=S100(), s10: S100=S100())
> val ds = Seq(S(),S(),S()).toDS
> ds.show()
> {code}
> I could reproduce this with Spark built from 1.6 branch and with 
> https://home.apache.org/~pwendell/spark-nightly/spark-master-bin/spark-2.0.0-SNAPSHOT-2016_05_11_01_03-8beae59-bin/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15285) Generated SpecificSafeProjection.apply method grows beyond 64 KB

2016-05-20 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15285:
---
Assignee: Wenchen Fan

> Generated SpecificSafeProjection.apply method grows beyond 64 KB
> 
>
> Key: SPARK-15285
> URL: https://issues.apache.org/jira/browse/SPARK-15285
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Konstantin Shaposhnikov
>Assignee: Wenchen Fan
>
> The following code snippet results in 
> {noformat}
>  org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Ljava/lang/Object;)Ljava/lang/Object;" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
> {noformat}
> {code}
> case class S100(s1:String="1", s2:String="2", s3:String="3", s4:String="4", 
> s5:String="5", s6:String="6", s7:String="7", s8:String="8", s9:String="9", 
> s10:String="10", s11:String="11", s12:String="12", s13:String="13", 
> s14:String="14", s15:String="15", s16:String="16", s17:String="17", 
> s18:String="18", s19:String="19", s20:String="20", s21:String="21", 
> s22:String="22", s23:String="23", s24:String="24", s25:String="25", 
> s26:String="26", s27:String="27", s28:String="28", s29:String="29", 
> s30:String="30", s31:String="31", s32:String="32", s33:String="33", 
> s34:String="34", s35:String="35", s36:String="36", s37:String="37", 
> s38:String="38", s39:String="39", s40:String="40", s41:String="41", 
> s42:String="42", s43:String="43", s44:String="44", s45:String="45", 
> s46:String="46", s47:String="47", s48:String="48", s49:String="49", 
> s50:String="50", s51:String="51", s52:String="52", s53:String="53", 
> s54:String="54", s55:String="55", s56:String="56", s57:String="57", 
> s58:String="58", s59:String="59", s60:String="60", s61:String="61", 
> s62:String="62", s63:String="63", s64:String="64", s65:String="65", 
> s66:String="66", s67:String="67", s68:String="68", s69:String="69", 
> s70:String="70", s71:String="71", s72:String="72", s73:String="73", 
> s74:String="74", s75:String="75", s76:String="76", s77:String="77", 
> s78:String="78", s79:String="79", s80:String="80", s81:String="81", 
> s82:String="82", s83:String="83", s84:String="84", s85:String="85", 
> s86:String="86", s87:String="87", s88:String="88", s89:String="89", 
> s90:String="90", s91:String="91", s92:String="92", s93:String="93", 
> s94:String="94", s95:String="95", s96:String="96", s97:String="97", 
> s98:String="98", s99:String="99", s100:String="100")
> case class S(s1: S100=S100(), s2: S100=S100(), s3: S100=S100(), s4: 
> S100=S100(), s5: S100=S100(), s6: S100=S100(), s7: S100=S100(), s8: 
> S100=S100(), s9: S100=S100(), s10: S100=S100())
> val ds = Seq(S(),S(),S()).toDS
> ds.show()
> {code}
> I could reproduce this with Spark built from 1.6 branch and with 
> https://home.apache.org/~pwendell/spark-nightly/spark-master-bin/spark-2.0.0-SNAPSHOT-2016_05_11_01_03-8beae59-bin/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14331) Exceptions saving to parquetFile after join from dataframes in master

2016-05-20 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294168#comment-15294168
 ] 

Davies Liu commented on SPARK-14331:


Could you post the full stacktrace? This exception should be caused by another 
one.

> Exceptions saving to parquetFile after join from dataframes in master
> -
>
> Key: SPARK-14331
> URL: https://issues.apache.org/jira/browse/SPARK-14331
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Priority: Critical
>
> I'm trying to use master and write to a parquet file when using a dataframe 
> but am seeing the exception below.  Not sure exact state of dataframes right 
> now so if this is known issue let me know.
> I read 2 sources of parquet files, joined them, then saved them back.
>  val df_pixels = sqlContext.read.parquet("data1")
> val df_pixels_renamed = df_pixels.withColumnRenamed("photo_id", 
> "pixels_photo_id")
> val df_meta = sqlContext.read.parquet("data2")
> val df = df_meta.as("meta").join(df_pixels_renamed, $"meta.photo_id" === 
> $"pixels_photo_id", "inner").drop("pixels_photo_id")
> df.write.parquet(args(0))
> 16/04/01 17:21:34 ERROR InsertIntoHadoopFsRelation: Aborting job.
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange hashpartitioning(pixels_photo_id#3, 2), None
> +- WholeStageCodegen
>:  +- Filter isnotnull(pixels_photo_id#3)
>: +- INPUT
>+- Coalesce 0
>   +- WholeStageCodegen
>  :  +- Project [img_data#0,photo_id#1 AS pixels_photo_id#3]
>  : +- Scan HadoopFiles[img_data#0,photo_id#1] Format: 
> ParquetFormat, PushedFilters: [], ReadSchema: 
> struct
> at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
> at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:109)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:137)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:134)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:117)
> at 
> org.apache.spark.sql.execution.InputAdapter.upstreams(WholeStageCodegen.scala:236)
> at org.apache.spark.sql.execution.Sort.upstreams(Sort.scala:104)
> at 
> org.apache.spark.sql.execution.WholeStageCodegen.doExecute(WholeStageCodegen.scala:351)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:137)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:134)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:117)
> at 
> org.apache.spark.sql.execution.InputAdapter.doExecute(WholeStageCodegen.scala:228)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:137)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:134)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:117)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15448) Flaky test:pyspark.ml.tests.DefaultValuesTests.test_java_params

2016-05-20 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-15448.
--
   Resolution: Duplicate
Fix Version/s: 2.0.0

> Flaky test:pyspark.ml.tests.DefaultValuesTests.test_java_params
> ---
>
> Key: SPARK-15448
> URL: https://issues.apache.org/jira/browse/SPARK-15448
> Project: Spark
>  Issue Type: Test
>Affects Versions: 2.0.0
>Reporter: Davies Liu
> Fix For: 2.0.0
>
>
> {code}
> ==
> FAIL [1.284s]: test_java_params (pyspark.ml.tests.DefaultValuesTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder/python/pyspark/ml/tests.py",
>  line 1161, in test_java_params
> self.check_params(cls())
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder/python/pyspark/ml/tests.py",
>  line 1136, in check_params
> % (p.name, str(py_stage)))
> AssertionError: True != False : Default value mismatch of param 
> linkPredictionCol for Params GeneralizedLinearRegression_4a78b84aab05b0ed2192
> --
> {code}
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3003/consoleFull



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14031) Dataframe to csv IO, system performance enters high CPU state and write operation takes 1 hour to complete

2016-05-20 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-14031:
--

Assignee: Davies Liu

> Dataframe to csv IO, system performance enters high CPU state and write 
> operation takes 1 hour to complete
> --
>
> Key: SPARK-14031
> URL: https://issues.apache.org/jira/browse/SPARK-14031
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
> Environment: MACOSX 10.11.2 Macbook Pro 16g - 2.2 GHz Intel Core i7 
> -1TB and Ubuntu14.04 Vagrant 4 Cores 8g
>Reporter: Vincent Ohprecio
>Assignee: Davies Liu
>Priority: Critical
> Attachments: visualVMscreenshot.png
>
>
> Summary
> When using spark-assembly-2.0.0/spark-shell trying to write out results of 
> dataframe to csv, system performance enters high CPU state and write 
> operation takes 1 hour to complete. 
> * Affecting: [Stage 5:>  (0 + 2) / 21]
> * Stage 5 elapsed time 348827227ns
> In comparison, tests where conducted using 1.4, 1.5, 1.6 with same code/data 
> and Stage5 csv write times where between 2 - 22 seconds. 
> In addition, Parquet (Stage 3) write tests 1.4, 1.5, 1.6 and 2.0 where 
> similar between 2 - 22 seconds.
> Files 
> 1. Data File is "2008.csv"
> 2. Data file download http://stat-computing.org/dataexpo/2009/the-data.html
> 3. Code https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb
> Observation 1 - Setup
> High CPU and 58 minute average completion time 
> * MACOSX 10.11.2
> * Macbook Pro 16g - 2.2 GHz Intel Core i7 -1TB 
> * spark-assembly-2.0.0
> * spark-csv_2.11-1.4
> * Code: https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb
> Observation 2 - Setup
> High CPU and waited over hour for csv write but didnt wait to complete 
> * Ubuntu14.04
> * 4cores 8gb
> * spark-assembly-2.0.0
> * spark-csv_2.11-1.4
> Code Output: https://gist.github.com/bigsnarfdude/930f5832c231c3d39651



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >