date:20160724

[jira] [Created] (SPARK-16704) Union does not work for column with array byte

2016-07-24 Thread Ng Jiunn Jye (JIRA)

Ng Jiunn Jye created SPARK-16704:


 Summary: Union does not work for column with array byte 
 Key: SPARK-16704
 URL: https://issues.apache.org/jira/browse/SPARK-16704
 Project: Spark
  Issue Type: Bug
Reporter: Ng Jiunn Jye


When union 2 query with columns having array of bytes datatype, spark query 
fail with exception.

Example :
select binaryColumn from tableA
 union
select binaryColumn from tableB

Note that  spark properties "spark.sql.parquet.binaryAsString" is set to true

org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
 ~[iop-spark-client.spark-catalyst_2.11-1.6.0.jar:1.6.0]
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) 
~[iop-spark-client.spark-catalyst_2.11-1.6.0.jar:1.6.0]
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:203)
 ~[iop-spark-client.spark-catalyst_2.11-1.6.0.jar:1.6.0]
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
 ~[iop-spark-client.spark-catalyst_2.11-1.6.0.jar:1.6.0]
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105) 
~[iop-spark-client.spark-catalyst_2.11-1.6.0.jar:1.6.0]
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:104)
 ~[iop-spark-client.spark-catalyst_2.11-1.6.0.jar:1.6.0]
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:104)
 ~[iop-spark-client.spark-catalyst_2.11-1.6.0.jar:1.6.0]
at scala.collection.immutable.List.foreach(List.scala:381) 
~[org.scala-lang.scala-library-2.11.8.jar:na]
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:104) 
~[iop-spark-client.spark-catalyst_2.11-1.6.0.jar:1.6.0]
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
 ~[iop-spark-client.spark-catalyst_2.11-1.6.0.jar:1.6.0]
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
 ~[iop-spark-client.spark-catalyst_2.11-1.6.0.jar:1.6.0]
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
 ~[iop-spark-client.spark-sql_2.11-1.6.0.jar:1.6.0]
at org.apache.spark.sql.DataFrame.(DataFrame.scala:133) 
~[iop-spark-client.spark-sql_2.11-1.6.0.jar:1.6.0]
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) 
~[iop-spark-client.spark-sql_2.11-1.6.0.jar:1.6.0]
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817) 
~[iop-spark-client.spark-sql_2.11-1.6.0.jar:1.6.0]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16702) Driver hangs after executors are lost

2016-07-24 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391409#comment-15391409
 ] 

Sean Owen commented on SPARK-16702:
---

It all sounds plausible. Have a look at 
https://issues.apache.org/jira/browse/SPARK-12419 and 
https://issues.apache.org/jira/browse/SPARK-16533 which may not be exactly the 
same thing but worth comparing.

Anything that simplifies or untangles blocking callbacks would probably be a 
win, though it's always delicate surgery. Go ahead.

> Driver hangs after executors are lost
> -
>
> Key: SPARK-16702
> URL: https://issues.apache.org/jira/browse/SPARK-16702
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Angus Gerry
> Attachments: SparkThreadsBlocked.txt
>
>
> It's my first time, please be kind.
> I'm still trying to debug this error locally - at this stage I'm pretty 
> convinced that it's a weird deadlock/livelock problem due to the use of 
> {{scheduleAtFixedRate}} within {{ExecutorAllocationManager}}. This problem is 
> possibly tangentially related to the issues discussed in SPARK-1560 around 
> the use of blocking calls within locks.
> h4. Observed Behavior
> When running a spark job, and executors are lost, the job occassionally goes 
> into a state where it makes no progress with tasks. Most commonly it seems 
> that the issue occurs when executors are preempted by yarn, but I'm not 
> confident enough to state that it's restricted to just this scenario.
> Upon inspecting a thread dump from the driver, the following stack traces 
> seem noteworthy (a full thread dump is attached):
> {noformat:title=Thread 178: spark-dynamic-executor-allocation (TIMED_WAITING)}
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> scala.concurrent.Await$.result(package.scala:190)
> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:447)
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1423)
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {noformat}
> {noformat:title=Thread 22: dispatcher-event-loop-10 (BLOCKED)}
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:289)
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121)
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120)
> scala.Option.foreach(Option.scala:257)
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120)
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142)
>

[jira] [Assigned] (SPARK-16703) Extra space in WindowSpecDefinition SQL representation

2016-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16703:


Assignee: Apache Spark  (was: Cheng Lian)

> Extra space in WindowSpecDefinition SQL representation
> --
>
> Key: SPARK-16703
> URL: https://issues.apache.org/jira/browse/SPARK-16703
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>Priority: Minor
>
> For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an 
> extra space in its SQL representation:
> {code:sql}
> ( ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16703) Extra space in WindowSpecDefinition SQL representation

2016-07-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391370#comment-15391370
 ] 

Apache Spark commented on SPARK-16703:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/14334

> Extra space in WindowSpecDefinition SQL representation
> --
>
> Key: SPARK-16703
> URL: https://issues.apache.org/jira/browse/SPARK-16703
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an 
> extra space in its SQL representation:
> {code:sql}
> ( ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16703) Extra space in WindowSpecDefinition SQL representation

2016-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16703:


Assignee: Cheng Lian  (was: Apache Spark)

> Extra space in WindowSpecDefinition SQL representation
> --
>
> Key: SPARK-16703
> URL: https://issues.apache.org/jira/browse/SPARK-16703
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an 
> extra space in its SQL representation:
> {code:sql}
> ( ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16703) Extra space in WindowSpecDefinition SQL representation

2016-07-24 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16703:
---
Description: 
For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an 
extra space in its SQL representation:

{code:sql}

{code}

  was:
For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an 
extra space in its SQL representation:

{code:sql}
{code}


> Extra space in WindowSpecDefinition SQL representation
> --
>
> Key: SPARK-16703
> URL: https://issues.apache.org/jira/browse/SPARK-16703
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an 
> extra space in its SQL representation:
> {code:sql}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16703) Extra space in WindowSpecDefinition SQL representation

2016-07-24 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16703:
---
Description: 
For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an 
extra space in its SQL representation:

{code:sql}
( ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
{code}

  was:
For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an 
extra space in its SQL representation:

{code:sql}

{code}


> Extra space in WindowSpecDefinition SQL representation
> --
>
> Key: SPARK-16703
> URL: https://issues.apache.org/jira/browse/SPARK-16703
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an 
> extra space in its SQL representation:
> {code:sql}
> ( ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16703) Extra space in WindowSpecDefinition SQL representation

2016-07-24 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-16703:
--

 Summary: Extra space in WindowSpecDefinition SQL representation
 Key: SPARK-16703
 URL: https://issues.apache.org/jira/browse/SPARK-16703
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor


For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an 
extra space in its SQL representation:

{code:sql}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16685) audit release docs are ambiguous

2016-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16685:


Assignee: Apache Spark

> audit release docs are ambiguous
> 
>
> Key: SPARK-16685
> URL: https://issues.apache.org/jira/browse/SPARK-16685
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 1.6.2
>Reporter: jay vyas
>Assignee: Apache Spark
>Priority: Minor
>
> The dev/audit-release tooling is ambiguous.
> - should it run against a real cluster? if so when?
> - what should be in the release repo?  Just jars? tarballs?  ( i assume jars 
> because its a .ivy, but not sure).
> - 
> https://github.com/apache/spark/tree/master/dev/audit-release



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16685) audit release docs are ambiguous

2016-07-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391361#comment-15391361
 ] 

Apache Spark commented on SPARK-16685:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14342

> audit release docs are ambiguous
> 
>
> Key: SPARK-16685
> URL: https://issues.apache.org/jira/browse/SPARK-16685
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 1.6.2
>Reporter: jay vyas
>Priority: Minor
>
> The dev/audit-release tooling is ambiguous.
> - should it run against a real cluster? if so when?
> - what should be in the release repo?  Just jars? tarballs?  ( i assume jars 
> because its a .ivy, but not sure).
> - 
> https://github.com/apache/spark/tree/master/dev/audit-release



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16685) audit release docs are ambiguous

2016-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16685:


Assignee: (was: Apache Spark)

> audit release docs are ambiguous
> 
>
> Key: SPARK-16685
> URL: https://issues.apache.org/jira/browse/SPARK-16685
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 1.6.2
>Reporter: jay vyas
>Priority: Minor
>
> The dev/audit-release tooling is ambiguous.
> - should it run against a real cluster? if so when?
> - what should be in the release repo?  Just jars? tarballs?  ( i assume jars 
> because its a .ivy, but not sure).
> - 
> https://github.com/apache/spark/tree/master/dev/audit-release



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16699) Fix performance bug in hash aggregate on long string keys

2016-07-24 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16699:

Description: 
In the following code in `VectorizedHashMapGenerator.scala`:

{code}
def hashBytes(b: String): String = {
  val hash = ctx.freshName("hash")
  s"""
 |int $result = 0;
 |for (int i = 0; i < $b.length; i++) {
 |  ${genComputeHash(ctx, s"$b[i]", ByteType, hash)}
 |  $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + 
($result >>> 2);
 |}
   """.stripMargin
}
{code}

when b=input.getBytes(), the current 2.0 code results in getBytes() being 
called n times, n being length of input. getBytes() involves memory copy is 
thus expensive and causes a performance degradation.
Fix is to evaluate getBytes() before the for loop.

  was:
In the following code in `VectorizedHashMapGenerator.scala`:
```
def hashBytes(b: String): String = {
  val hash = ctx.freshName("hash")
  s"""
 |int $result = 0;
 |for (int i = 0; i < $b.length; i++) {
 |  ${genComputeHash(ctx, s"$b[i]", ByteType, hash)}
 |  $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + 
($result >>> 2);
 |}
   """.stripMargin
}

```
when b=input.getBytes(), the current 2.0 code results in getBytes() being 
called n times, n being length of input. getBytes() involves memory copy is 
thus expensive and causes a performance degradation.
Fix is to evaluate getBytes() before the for loop.


> Fix performance bug in hash aggregate on long string keys
> -
>
> Key: SPARK-16699
> URL: https://issues.apache.org/jira/browse/SPARK-16699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Qifan Pu
>Assignee: Qifan Pu
> Fix For: 2.0.1, 2.1.0
>
>
> In the following code in `VectorizedHashMapGenerator.scala`:
> {code}
> def hashBytes(b: String): String = {
>   val hash = ctx.freshName("hash")
>   s"""
>  |int $result = 0;
>  |for (int i = 0; i < $b.length; i++) {
>  |  ${genComputeHash(ctx, s"$b[i]", ByteType, hash)}
>  |  $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + 
> ($result >>> 2);
>  |}
>""".stripMargin
> }
> {code}
> when b=input.getBytes(), the current 2.0 code results in getBytes() being 
> called n times, n being length of input. getBytes() involves memory copy is 
> thus expensive and causes a performance degradation.
> Fix is to evaluate getBytes() before the for loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16699) Fix performance bug in hash aggregate on long string keys

2016-07-24 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16699.
-
   Resolution: Fixed
 Assignee: Qifan Pu
Fix Version/s: (was: 2.0.0)
   2.1.0
   2.0.1

> Fix performance bug in hash aggregate on long string keys
> -
>
> Key: SPARK-16699
> URL: https://issues.apache.org/jira/browse/SPARK-16699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Qifan Pu
>Assignee: Qifan Pu
> Fix For: 2.0.1, 2.1.0
>
>
> In the following code in `VectorizedHashMapGenerator.scala`:
> ```
> def hashBytes(b: String): String = {
>   val hash = ctx.freshName("hash")
>   s"""
>  |int $result = 0;
>  |for (int i = 0; i < $b.length; i++) {
>  |  ${genComputeHash(ctx, s"$b[i]", ByteType, hash)}
>  |  $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + 
> ($result >>> 2);
>  |}
>""".stripMargin
> }
> ```
> when b=input.getBytes(), the current 2.0 code results in getBytes() being 
> called n times, n being length of input. getBytes() involves memory copy is 
> thus expensive and causes a performance degradation.
> Fix is to evaluate getBytes() before the for loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16702) Driver hangs after executors are lost

2016-07-24 Thread Angus Gerry (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Angus Gerry updated SPARK-16702:

Attachment: SparkThreadsBlocked.txt

> Driver hangs after executors are lost
> -
>
> Key: SPARK-16702
> URL: https://issues.apache.org/jira/browse/SPARK-16702
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Angus Gerry
> Attachments: SparkThreadsBlocked.txt
>
>
> It's my first time, please be kind.
> I'm still trying to debug this error locally - at this stage I'm pretty 
> convinced that it's a weird deadlock/livelock problem due to the use of 
> {{scheduleAtFixedRate}} within {{ExecutorAllocationManager}}. This problem is 
> possibly tangentially related to the issues discussed in SPARK-1560 around 
> the use of blocking calls within locks.
> h4. Observed Behavior
> When running a spark job, and executors are lost, the job occassionally goes 
> into a state where it makes no progress with tasks. Most commonly it seems 
> that the issue occurs when executors are preempted by yarn, but I'm not 
> confident enough to state that it's restricted to just this scenario.
> Upon inspecting a thread dump from the driver, the following stack traces 
> seem noteworthy (a full thread dump is attached):
> {noformat:title=Thread 178: spark-dynamic-executor-allocation (TIMED_WAITING)}
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> scala.concurrent.Await$.result(package.scala:190)
> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:447)
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1423)
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {noformat}
> {noformat:title=Thread 22: dispatcher-event-loop-10 (BLOCKED)}
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:289)
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121)
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120)
> scala.Option.foreach(Option.scala:257)
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120)
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142)
> org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
> org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor

[jira] [Created] (SPARK-16702) Driver hangs after executors are lost

2016-07-24 Thread Angus Gerry (JIRA)

Angus Gerry created SPARK-16702:
---

 Summary: Driver hangs after executors are lost
 Key: SPARK-16702
 URL: https://issues.apache.org/jira/browse/SPARK-16702
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0, 1.6.2, 1.6.1
Reporter: Angus Gerry


It's my first time, please be kind.

I'm still trying to debug this error locally - at this stage I'm pretty 
convinced that it's a weird deadlock/livelock problem due to the use of 
{{scheduleAtFixedRate}} within {{ExecutorAllocationManager}}. This problem is 
possibly tangentially related to the issues discussed in SPARK-1560 around the 
use of blocking calls within locks.

h4. Observed Behavior
When running a spark job, and executors are lost, the job occassionally goes 
into a state where it makes no progress with tasks. Most commonly it seems that 
the issue occurs when executors are preempted by yarn, but I'm not confident 
enough to state that it's restricted to just this scenario.

Upon inspecting a thread dump from the driver, the following stack traces seem 
noteworthy (a full thread dump is attached):
{noformat:title=Thread 178: spark-dynamic-executor-allocation (TIMED_WAITING)}
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033)
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
scala.concurrent.Await$.result(package.scala:190)
org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:447)
org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1423)
org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
{noformat}

{noformat:title=Thread 22: dispatcher-event-loop-10 (BLOCKED)}
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:289)
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121)
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120)
scala.Option.foreach(Option.scala:257)
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120)
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142)
org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
{noformat}

{noformat:title=Thread 640: kill-executor-thread (BLOCKED)}
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.killExecutors(CoarseGrainedSchedulerBackend.scala:488)
org.apache.spark.SparkContext.killAndReplaceExecutor(SparkContext.scala:1499)
org.apache.spark.HeartbeatRec

[jira] [Commented] (SPARK-16534) Kafka 0.10 Python support

2016-07-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391274#comment-15391274
 ] 

Apache Spark commented on SPARK-16534:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/14340

> Kafka 0.10 Python support
> -
>
> Key: SPARK-16534
> URL: https://issues.apache.org/jira/browse/SPARK-16534
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16534) Kafka 0.10 Python support

2016-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16534:


Assignee: (was: Apache Spark)

> Kafka 0.10 Python support
> -
>
> Key: SPARK-16534
> URL: https://issues.apache.org/jira/browse/SPARK-16534
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16534) Kafka 0.10 Python support

2016-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16534:


Assignee: Apache Spark

> Kafka 0.10 Python support
> -
>
> Key: SPARK-16534
> URL: https://issues.apache.org/jira/browse/SPARK-16534
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5581) When writing sorted map output file, avoid open / close between each partition

2016-07-24 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5581:
--
Assignee: Josh Rosen

> When writing sorted map output file, avoid open / close between each partition
> --
>
> Key: SPARK-5581
> URL: https://issues.apache.org/jira/browse/SPARK-5581
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.3.0
>Reporter: Sandy Ryza
>Assignee: Josh Rosen
> Fix For: 2.1.0
>
>
> {code}
>   // Bypassing merge-sort; get an iterator by partition and just write 
> everything directly.
>   for ((id, elements) <- this.partitionedIterator) {
> if (elements.hasNext) {
>   val writer = blockManager.getDiskWriter(
> blockId, outputFile, ser, fileBufferSize, 
> context.taskMetrics.shuffleWriteMetrics.get)
>   for (elem <- elements) {
> writer.write(elem)
>   }
>   writer.commitAndClose()
>   val segment = writer.fileSegment()
>   lengths(id) = segment.length
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5581) When writing sorted map output file, avoid open / close between each partition

2016-07-24 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5581:
--
Assignee: Brian Cho  (was: Josh Rosen)

> When writing sorted map output file, avoid open / close between each partition
> --
>
> Key: SPARK-5581
> URL: https://issues.apache.org/jira/browse/SPARK-5581
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.3.0
>Reporter: Sandy Ryza
>Assignee: Brian Cho
> Fix For: 2.1.0
>
>
> {code}
>   // Bypassing merge-sort; get an iterator by partition and just write 
> everything directly.
>   for ((id, elements) <- this.partitionedIterator) {
> if (elements.hasNext) {
>   val writer = blockManager.getDiskWriter(
> blockId, outputFile, ser, fileBufferSize, 
> context.taskMetrics.shuffleWriteMetrics.get)
>   for (elem <- elements) {
> writer.write(elem)
>   }
>   writer.commitAndClose()
>   val segment = writer.fileSegment()
>   lengths(id) = segment.length
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16603) Spark2.0 fail in executing the sql statement which field name begins with number,like "d.30_day_loss_user" while spark1.6 supports

2016-07-24 Thread keliang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391268#comment-15391268
 ] 

keliang edited comment on SPARK-16603 at 7/25/16 2:46 AM:
--

Hi, I test this feature with spark-2.0.1-snapshot:
first. create testing table tsp with column name [10_user_age: Int, 
20_user_addr: String], and success
==
CREATE TABLE `tsp`(`10_user_age` int, `20_user_addr` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
)
STORED AS
  INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
TBLPROPERTIES (
  'transient_lastDdlTime' = '1469418111'
)

second, query tsp with "select * from tsp where tsp.20_user_addr <10". get 
error:
Error in query:
mismatched input '.20' expecting {, '.', '[', 'GROUP', 'ORDER', 'HAVING', 
'LIMIT', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 'IS', 'WINDOW', 
'UNION', 'EXCEPT', 'INTERSECT', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', 
'-', '*', '/', '%', 'DIV', '&', '|', '^', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 
1, pos 27)

== SQL ==
select * from tsp where tsp.20_user_addr <10
---^^^

Question is Why using Digit prefix in the column name to create table passed  
the validation rules, but failed when query table ? self-contradiction ？




was (Author: biglobster):
Hi, I test this feature with spark-2.0.1-snapshot:
first. create testing table tsp with column name [10_user_age: Int, 
20_user_addr: String], and success
==
CREATE TABLE `tsp`(`10_user_age` int, `20_user_addr` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
)
STORED AS
  INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
TBLPROPERTIES (
  'transient_lastDdlTime' = '1469418111'
)

second, query tsp with "select * from tsp where tsp.20_user_addr <10". get 
error:
Error in query:
mismatched input '.20' expecting {, '.', '[', 'GROUP', 'ORDER', 'HAVING', 
'LIMIT', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 'IS', 'WINDOW', 
'UNION', 'EXCEPT', 'INTERSECT', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', 
'-', '*', '/', '%', 'DIV', '&', '|', '^', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 
1, pos 27)

== SQL ==
select * from tsp where tsp.20_user_addr <10
---^^^

Question is Why using Digit prefix in the column name to create table passed  
the validation rules, but failed when query table ?



> Spark2.0 fail in executing the sql statement which field name begins with 
> number,like "d.30_day_loss_user" while spark1.6 supports
> --
>
> Key: SPARK-16603
> URL: https://issues.apache.org/jira/browse/SPARK-16603
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: marymwu
>Priority: Minor
>
> Spark2.0 fail in executing the sql statement which field name begins with 
> number,like "d.30_day_loss_user" while spark1.6 supports
> Error: org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 
> '.30' expecting
> {')', ','}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16603) Spark2.0 fail in executing the sql statement which field name begins with number,like "d.30_day_loss_user" while spark1.6 supports

2016-07-24 Thread keliang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391268#comment-15391268
 ] 

keliang commented on SPARK-16603:
-

Hi, I test this feature with spark-2.0.1-snapshot:
first. create testing table tsp with column name [10_user_age: Int, 
20_user_addr: String], and success
==
CREATE TABLE `tsp`(`10_user_age` int, `20_user_addr` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
)
STORED AS
  INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
TBLPROPERTIES (
  'transient_lastDdlTime' = '1469418111'
)

second, query tsp with "select * from tsp where tsp.20_user_addr <10". get 
error:
Error in query:
mismatched input '.20' expecting {, '.', '[', 'GROUP', 'ORDER', 'HAVING', 
'LIMIT', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 'IS', 'WINDOW', 
'UNION', 'EXCEPT', 'INTERSECT', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', 
'-', '*', '/', '%', 'DIV', '&', '|', '^', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 
1, pos 27)

== SQL ==
select * from tsp where tsp.20_user_addr <10
---^^^

Question is Why using Digit prefix in the column name to create table passed  
the validation rules, but failed when query table ?



> Spark2.0 fail in executing the sql statement which field name begins with 
> number,like "d.30_day_loss_user" while spark1.6 supports
> --
>
> Key: SPARK-16603
> URL: https://issues.apache.org/jira/browse/SPARK-16603
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: marymwu
>Priority: Minor
>
> Spark2.0 fail in executing the sql statement which field name begins with 
> number,like "d.30_day_loss_user" while spark1.6 supports
> Error: org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 
> '.30' expecting
> {')', ','}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5581) When writing sorted map output file, avoid open / close between each partition

2016-07-24 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5581.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 13382
[https://github.com/apache/spark/pull/13382]

> When writing sorted map output file, avoid open / close between each partition
> --
>
> Key: SPARK-5581
> URL: https://issues.apache.org/jira/browse/SPARK-5581
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.3.0
>Reporter: Sandy Ryza
> Fix For: 2.1.0
>
>
> {code}
>   // Bypassing merge-sort; get an iterator by partition and just write 
> everything directly.
>   for ((id, elements) <- this.partitionedIterator) {
> if (elements.hasNext) {
>   val writer = blockManager.getDiskWriter(
> blockId, outputFile, ser, fileBufferSize, 
> context.taskMetrics.shuffleWriteMetrics.get)
>   for (elem <- elements) {
> writer.write(elem)
>   }
>   writer.commitAndClose()
>   val segment = writer.fileSegment()
>   lengths(id) = segment.length
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16698) json parsing regression - "." in keys

2016-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16698:


Assignee: Apache Spark

> json parsing regression - "." in keys
> -
>
> Key: SPARK-16698
> URL: https://issues.apache.org/jira/browse/SPARK-16698
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: TobiasP
>Assignee: Apache Spark
>
> The commit 83775bc78e183791f75a99cdfbcd68a67ca0d472 "\[SPARK-14158]\[SQL] 
> implement buildReader for json data source" breaks parsing of json files with 
> "." in keys.
> E.g. the test input for spark-solr 
> https://github.com/lucidworks/spark-solr/blob/master/src/test/resources/test-data/events.json
> {noformat}
> scala> 
> sqlContext.read.json("src/test/resources/test-data/events.json").collectAsList
> org.apache.spark.sql.AnalysisException: Unable to resolve params.title_s 
> given [_version_, count_l, doc_id_s, flag_s, id, params.title_s, 
> params.url_s, session_id_s, timestamp_tdt, type_s, tz_timestamp_txt, 
> user_id_s];
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:130)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:126)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126)
>   at 
> org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:80)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$SpecialLimits$.apply(SparkStrategies.scala:53)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:52)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:50)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:57)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2321)
>   at org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2040)
>   ... 49 elided
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16698) json parsing regression - "." in keys

2016-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16698:


Assignee: (was: Apache Spark)

> json parsing regression - "." in keys
> -
>
> Key: SPARK-16698
> URL: https://issues.apache.org/jira/browse/SPARK-16698
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: TobiasP
>
> The commit 83775bc78e183791f75a99cdfbcd68a67ca0d472 "\[SPARK-14158]\[SQL] 
> implement buildReader for json data source" breaks parsing of json files with 
> "." in keys.
> E.g. the test input for spark-solr 
> https://github.com/lucidworks/spark-solr/blob/master/src/test/resources/test-data/events.json
> {noformat}
> scala> 
> sqlContext.read.json("src/test/resources/test-data/events.json").collectAsList
> org.apache.spark.sql.AnalysisException: Unable to resolve params.title_s 
> given [_version_, count_l, doc_id_s, flag_s, id, params.title_s, 
> params.url_s, session_id_s, timestamp_tdt, type_s, tz_timestamp_txt, 
> user_id_s];
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:130)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:126)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126)
>   at 
> org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:80)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$SpecialLimits$.apply(SparkStrategies.scala:53)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:52)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:50)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:57)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2321)
>   at org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2040)
>   ... 49 elided
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16698) json parsing regression - "." in keys

2016-07-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391264#comment-15391264
 ] 

Apache Spark commented on SPARK-16698:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/14339

> json parsing regression - "." in keys
> -
>
> Key: SPARK-16698
> URL: https://issues.apache.org/jira/browse/SPARK-16698
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: TobiasP
>
> The commit 83775bc78e183791f75a99cdfbcd68a67ca0d472 "\[SPARK-14158]\[SQL] 
> implement buildReader for json data source" breaks parsing of json files with 
> "." in keys.
> E.g. the test input for spark-solr 
> https://github.com/lucidworks/spark-solr/blob/master/src/test/resources/test-data/events.json
> {noformat}
> scala> 
> sqlContext.read.json("src/test/resources/test-data/events.json").collectAsList
> org.apache.spark.sql.AnalysisException: Unable to resolve params.title_s 
> given [_version_, count_l, doc_id_s, flag_s, id, params.title_s, 
> params.url_s, session_id_s, timestamp_tdt, type_s, tz_timestamp_txt, 
> user_id_s];
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:130)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:126)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126)
>   at 
> org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:80)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$SpecialLimits$.apply(SparkStrategies.scala:53)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:52)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:50)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:57)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2321)
>   at org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2040)
>   ... 49 elided
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16701) Make parameters configurable in BlockManager

2016-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16701:


Assignee: (was: Apache Spark)

> Make parameters configurable in BlockManager
> 
>
> Key: SPARK-16701
> URL: https://issues.apache.org/jira/browse/SPARK-16701
> Project: Spark
>  Issue Type: Improvement
>Reporter: YangyangLiu
>Priority: Minor
>
> Make parameters configurable in BlockManager class, such as max_attempts and 
> sleep_time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16701) Make parameters configurable in BlockManager

2016-07-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391247#comment-15391247
 ] 

Apache Spark commented on SPARK-16701:
--

User 'lovexi' has created a pull request for this issue:
https://github.com/apache/spark/pull/14338

> Make parameters configurable in BlockManager
> 
>
> Key: SPARK-16701
> URL: https://issues.apache.org/jira/browse/SPARK-16701
> Project: Spark
>  Issue Type: Improvement
>Reporter: YangyangLiu
>Priority: Minor
>
> Make parameters configurable in BlockManager class, such as max_attempts and 
> sleep_time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16701) Make parameters configurable in BlockManager

2016-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16701:


Assignee: Apache Spark

> Make parameters configurable in BlockManager
> 
>
> Key: SPARK-16701
> URL: https://issues.apache.org/jira/browse/SPARK-16701
> Project: Spark
>  Issue Type: Improvement
>Reporter: YangyangLiu
>Assignee: Apache Spark
>Priority: Minor
>
> Make parameters configurable in BlockManager class, such as max_attempts and 
> sleep_time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16701) Make parameters configurable in BlockManager

2016-07-24 Thread YangyangLiu (JIRA)

YangyangLiu created SPARK-16701:
---

 Summary: Make parameters configurable in BlockManager
 Key: SPARK-16701
 URL: https://issues.apache.org/jira/browse/SPARK-16701
 Project: Spark
  Issue Type: Improvement
Reporter: YangyangLiu
Priority: Minor


Make parameters configurable in BlockManager class, such as max_attempts and 
sleep_time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16700) StructType doesn't accept Python dicts anymore

2016-07-24 Thread Sylvain Zimmer (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sylvain Zimmer updated SPARK-16700:
---
Component/s: (was: Spark Core)

> StructType doesn't accept Python dicts anymore
> --
>
> Key: SPARK-16700
> URL: https://issues.apache.org/jira/browse/SPARK-16700
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello,
> I found this issue while testing my codebase with 2.0.0-rc5
> StructType in Spark 1.6.2 accepts the Python  type, which is very 
> handy. 2.0.0-rc5 does not and throws an error.
> I don't know if this was intended but I'd advocate for this behaviour to 
> remain the same. MapType is probably wasteful when your key names never 
> change and switching to Python tuples would be cumbersome.
> Here is a minimal script to reproduce the issue: 
> {code}
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> struct_schema = SparkTypes.StructType([
> SparkTypes.StructField("id", SparkTypes.LongType())
> ])
> rdd = sc.parallelize([{"id": 0}, {"id": 1}])
> df = sqlc.createDataFrame(rdd, struct_schema)
> print df.collect()
> # 1.6.2 prints [Row(id=0), Row(id=1)]
> # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in 
> type 
> {code}
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16700) StructType doesn't accept Python dicts anymore

2016-07-24 Thread Sylvain Zimmer (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sylvain Zimmer updated SPARK-16700:
---
Description: 
Hello,

I found this issue while testing my codebase with 2.0.0-rc5

StructType in Spark 1.6.2 accepts the Python  type, which is very handy. 
2.0.0-rc5 does not and throws an error.

I don't know if this was intended but I'd advocate for this behaviour to remain 
the same. MapType is probably wasteful when your key names never change and 
switching to Python tuples would be cumbersome.

Here is a minimal script to reproduce the issue: 

{code}
from pyspark import SparkContext
from pyspark.sql import types as SparkTypes
from pyspark.sql import SQLContext


sc = SparkContext()
sqlc = SQLContext(sc)

struct_schema = SparkTypes.StructType([
SparkTypes.StructField("id", SparkTypes.LongType())
])

rdd = sc.parallelize([{"id": 0}, {"id": 1}])

df = sqlc.createDataFrame(rdd, struct_schema)

print df.collect()

# 1.6.2 prints [Row(id=0), Row(id=1)]

# 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in 
type 

{code}

Thanks!

  was:
Hello,

I found this issue while testing my codebase with 2.0.0-rc5

StructType in Spark 1.6.2 accepts the Python  type, which is very handy. 
2.0.0-rc5 does not and throws an error.

I don't know if this was intended but I'd advocate for this behaviour to remain 
the same. MapType is probably wasteful when your key names never change and 
switching to Python tuples would be cumbersome.

Here is a minimal script to reproduce the issue: 

{code:python}
from pyspark import SparkContext
from pyspark.sql import types as SparkTypes
from pyspark.sql import SQLContext


sc = SparkContext()
sqlc = SQLContext(sc)

struct_schema = SparkTypes.StructType([
SparkTypes.StructField("id", SparkTypes.LongType())
])

rdd = sc.parallelize([{"id": 0}, {"id": 1}])

df = sqlc.createDataFrame(rdd, struct_schema)

print df.collect()

# 1.6.2 prints [Row(id=0), Row(id=1)]

# 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in 
type 

{code}

Thanks!


> StructType doesn't accept Python dicts anymore
> --
>
> Key: SPARK-16700
> URL: https://issues.apache.org/jira/browse/SPARK-16700
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello,
> I found this issue while testing my codebase with 2.0.0-rc5
> StructType in Spark 1.6.2 accepts the Python  type, which is very 
> handy. 2.0.0-rc5 does not and throws an error.
> I don't know if this was intended but I'd advocate for this behaviour to 
> remain the same. MapType is probably wasteful when your key names never 
> change and switching to Python tuples would be cumbersome.
> Here is a minimal script to reproduce the issue: 
> {code}
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> struct_schema = SparkTypes.StructType([
> SparkTypes.StructField("id", SparkTypes.LongType())
> ])
> rdd = sc.parallelize([{"id": 0}, {"id": 1}])
> df = sqlc.createDataFrame(rdd, struct_schema)
> print df.collect()
> # 1.6.2 prints [Row(id=0), Row(id=1)]
> # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in 
> type 
> {code}
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16700) StructType doesn't accept Python dicts anymore

2016-07-24 Thread Sylvain Zimmer (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sylvain Zimmer updated SPARK-16700:
---
Component/s: PySpark

> StructType doesn't accept Python dicts anymore
> --
>
> Key: SPARK-16700
> URL: https://issues.apache.org/jira/browse/SPARK-16700
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello,
> I found this issue while testing my codebase with 2.0.0-rc5
> StructType in Spark 1.6.2 accepts the Python  type, which is very 
> handy. 2.0.0-rc5 does not and throws an error.
> I don't know if this was intended but I'd advocate for this behaviour to 
> remain the same. MapType is probably wasteful when your key names never 
> change and switching to Python tuples would be cumbersome.
> Here is a minimal script to reproduce the issue: 
> {code}
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> struct_schema = SparkTypes.StructType([
> SparkTypes.StructField("id", SparkTypes.LongType())
> ])
> rdd = sc.parallelize([{"id": 0}, {"id": 1}])
> df = sqlc.createDataFrame(rdd, struct_schema)
> print df.collect()
> # 1.6.2 prints [Row(id=0), Row(id=1)]
> # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in 
> type 
> {code}
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16700) StructType doesn't accept Python dicts anymore

2016-07-24 Thread Sylvain Zimmer (JIRA)

Sylvain Zimmer created SPARK-16700:
--

 Summary: StructType doesn't accept Python dicts anymore
 Key: SPARK-16700
 URL: https://issues.apache.org/jira/browse/SPARK-16700
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Sylvain Zimmer


Hello,

I found this issue while testing my codebase with 2.0.0-rc5

StructType in Spark 1.6.2 accepts the Python  type, which is very handy. 
2.0.0-rc5 does not and throws an error.

I don't know if this was intended but I'd advocate for this behaviour to remain 
the same. MapType is probably wasteful when your key names never change and 
switching to Python tuples would be cumbersome.

Here is a minimal script to reproduce the issue: 

{code:python}
from pyspark import SparkContext
from pyspark.sql import types as SparkTypes
from pyspark.sql import SQLContext


sc = SparkContext()
sqlc = SQLContext(sc)

struct_schema = SparkTypes.StructType([
SparkTypes.StructField("id", SparkTypes.LongType())
])

rdd = sc.parallelize([{"id": 0}, {"id": 1}])

df = sqlc.createDataFrame(rdd, struct_schema)

print df.collect()

# 1.6.2 prints [Row(id=0), Row(id=1)]

# 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in 
type 

{code}

Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16699) Fix performance bug in hash aggregate on long string keys

2016-07-24 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391239#comment-15391239
 ] 

Dongjoon Hyun commented on SPARK-16699:
---

Hi, [~qifan].
Nice catch! By the way, usually, only committers set `FIX VERSION` field.
You had better leave it blank next time. :)

> Fix performance bug in hash aggregate on long string keys
> -
>
> Key: SPARK-16699
> URL: https://issues.apache.org/jira/browse/SPARK-16699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Qifan Pu
> Fix For: 2.0.0
>
>
> In the following code in `VectorizedHashMapGenerator.scala`:
> ```
> def hashBytes(b: String): String = {
>   val hash = ctx.freshName("hash")
>   s"""
>  |int $result = 0;
>  |for (int i = 0; i < $b.length; i++) {
>  |  ${genComputeHash(ctx, s"$b[i]", ByteType, hash)}
>  |  $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + 
> ($result >>> 2);
>  |}
>""".stripMargin
> }
> ```
> when b=input.getBytes(), the current 2.0 code results in getBytes() being 
> called n times, n being length of input. getBytes() involves memory copy is 
> thus expensive and causes a performance degradation.
> Fix is to evaluate getBytes() before the for loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16645) rename CatalogStorageFormat.serdeProperties to properties

2016-07-24 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-16645.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14283
[https://github.com/apache/spark/pull/14283]

> rename CatalogStorageFormat.serdeProperties to properties
> -
>
> Key: SPARK-16645
> URL: https://issues.apache.org/jira/browse/SPARK-16645
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16698) json parsing regression - "." in keys

2016-07-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391231#comment-15391231
 ] 

Hyukjin Kwon commented on SPARK-16698:
--

It seems it does not work for all `FileFormat` data sources. The code below 
also does not work.

{code}
test("SPARK-16698 - csv parsing regression - "." in keys") {
  withTempPath { path =>
val csv ="""
|a.b
|123
  """.stripMargin
spark.sparkContext
  .parallelize(csv :: Nil)
  .saveAsTextFile(path.getAbsolutePath)
spark.read
  .option("inferSchema", "true")
  .option("header", "true")
  .csv(path.getAbsolutePath).select("`a.b`").show()
  }
}
{code}

This is related with 
https://github.com/apache/spark/blob/37f3be5d29192db0a54f6c4699237b149bd0ecae/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L88-L89



> json parsing regression - "." in keys
> -
>
> Key: SPARK-16698
> URL: https://issues.apache.org/jira/browse/SPARK-16698
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: TobiasP
>
> The commit 83775bc78e183791f75a99cdfbcd68a67ca0d472 "\[SPARK-14158]\[SQL] 
> implement buildReader for json data source" breaks parsing of json files with 
> "." in keys.
> E.g. the test input for spark-solr 
> https://github.com/lucidworks/spark-solr/blob/master/src/test/resources/test-data/events.json
> {noformat}
> scala> 
> sqlContext.read.json("src/test/resources/test-data/events.json").collectAsList
> org.apache.spark.sql.AnalysisException: Unable to resolve params.title_s 
> given [_version_, count_l, doc_id_s, flag_s, id, params.title_s, 
> params.url_s, session_id_s, timestamp_tdt, type_s, tz_timestamp_txt, 
> user_id_s];
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:130)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:126)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126)
>   at 
> org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:80)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$SpecialLimits$.apply(SparkStrategies.scala:53)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:52)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:50)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:57)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2321)
>   at org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2040)
>   ... 49 elided
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional comma

[jira] [Commented] (SPARK-16698) json parsing regression - "." in keys

2016-07-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391217#comment-15391217
 ] 

Hyukjin Kwon commented on SPARK-16698:
--

FYI, this does not happen when it is read from json RDD. Let me please leave a 
code to reproduce that self-contains the issue.

This does not work (in `JsonSuite.scala`)

{code}
test("SPARK-16698 - json parsing regression - "." in keys") {
  withTempPath { path =>
val json =""" {"a.b":"data"}"""
spark.sparkContext
  .parallelize(json :: Nil)
  .saveAsTextFile(path.getAbsolutePath)
spark.read.json(path.getAbsolutePath).collect()
  }
}
{code}

This works

{code}
test("SPARK-16698 - json parsing regression - "." in keys") {
  withTempPath { path =>
val json =""" {"a.b":"data"}"""
val rdd = spark.sparkContext
  .parallelize(json :: Nil)
spark.read.json(rdd).collect()
  }
}
{code}


> json parsing regression - "." in keys
> -
>
> Key: SPARK-16698
> URL: https://issues.apache.org/jira/browse/SPARK-16698
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: TobiasP
>
> The commit 83775bc78e183791f75a99cdfbcd68a67ca0d472 "\[SPARK-14158]\[SQL] 
> implement buildReader for json data source" breaks parsing of json files with 
> "." in keys.
> E.g. the test input for spark-solr 
> https://github.com/lucidworks/spark-solr/blob/master/src/test/resources/test-data/events.json
> {noformat}
> scala> 
> sqlContext.read.json("src/test/resources/test-data/events.json").collectAsList
> org.apache.spark.sql.AnalysisException: Unable to resolve params.title_s 
> given [_version_, count_l, doc_id_s, flag_s, id, params.title_s, 
> params.url_s, session_id_s, timestamp_tdt, type_s, tz_timestamp_txt, 
> user_id_s];
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:130)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:126)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126)
>   at 
> org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:80)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$SpecialLimits$.apply(SparkStrategies.scala:53)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:52)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:50)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:57)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2321)
>   at org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2040)
>   ... 49 elided
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For add

[jira] [Assigned] (SPARK-16699) Fix performance bug in hash aggregate on long string keys

2016-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16699:


Assignee: Apache Spark

> Fix performance bug in hash aggregate on long string keys
> -
>
> Key: SPARK-16699
> URL: https://issues.apache.org/jira/browse/SPARK-16699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Qifan Pu
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>
> In the following code in `VectorizedHashMapGenerator.scala`:
> ```
> def hashBytes(b: String): String = {
>   val hash = ctx.freshName("hash")
>   s"""
>  |int $result = 0;
>  |for (int i = 0; i < $b.length; i++) {
>  |  ${genComputeHash(ctx, s"$b[i]", ByteType, hash)}
>  |  $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + 
> ($result >>> 2);
>  |}
>""".stripMargin
> }
> ```
> when b=input.getBytes(), the current 2.0 code results in getBytes() being 
> called n times, n being length of input. getBytes() involves memory copy is 
> thus expensive and causes a performance degradation.
> Fix is to evaluate getBytes() before the for loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16699) Fix performance bug in hash aggregate on long string keys

2016-07-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391191#comment-15391191
 ] 

Apache Spark commented on SPARK-16699:
--

User 'ooq' has created a pull request for this issue:
https://github.com/apache/spark/pull/14337

> Fix performance bug in hash aggregate on long string keys
> -
>
> Key: SPARK-16699
> URL: https://issues.apache.org/jira/browse/SPARK-16699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Qifan Pu
> Fix For: 2.0.0
>
>
> In the following code in `VectorizedHashMapGenerator.scala`:
> ```
> def hashBytes(b: String): String = {
>   val hash = ctx.freshName("hash")
>   s"""
>  |int $result = 0;
>  |for (int i = 0; i < $b.length; i++) {
>  |  ${genComputeHash(ctx, s"$b[i]", ByteType, hash)}
>  |  $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + 
> ($result >>> 2);
>  |}
>""".stripMargin
> }
> ```
> when b=input.getBytes(), the current 2.0 code results in getBytes() being 
> called n times, n being length of input. getBytes() involves memory copy is 
> thus expensive and causes a performance degradation.
> Fix is to evaluate getBytes() before the for loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16699) Fix performance bug in hash aggregate on long string keys

2016-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16699:


Assignee: (was: Apache Spark)

> Fix performance bug in hash aggregate on long string keys
> -
>
> Key: SPARK-16699
> URL: https://issues.apache.org/jira/browse/SPARK-16699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Qifan Pu
> Fix For: 2.0.0
>
>
> In the following code in `VectorizedHashMapGenerator.scala`:
> ```
> def hashBytes(b: String): String = {
>   val hash = ctx.freshName("hash")
>   s"""
>  |int $result = 0;
>  |for (int i = 0; i < $b.length; i++) {
>  |  ${genComputeHash(ctx, s"$b[i]", ByteType, hash)}
>  |  $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + 
> ($result >>> 2);
>  |}
>""".stripMargin
> }
> ```
> when b=input.getBytes(), the current 2.0 code results in getBytes() being 
> called n times, n being length of input. getBytes() involves memory copy is 
> thus expensive and causes a performance degradation.
> Fix is to evaluate getBytes() before the for loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16699) Fix performance bug in hash aggregate on long string keys

2016-07-24 Thread Qifan Pu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qifan Pu updated SPARK-16699:
-
Description: 
In the following code in `VectorizedHashMapGenerator.scala`:
```
def hashBytes(b: String): String = {
  val hash = ctx.freshName("hash")
  s"""
 |int $result = 0;
 |for (int i = 0; i < $b.length; i++) {
 |  ${genComputeHash(ctx, s"$b[i]", ByteType, hash)}
 |  $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + 
($result >>> 2);
 |}
   """.stripMargin
}

```
when b=input.getBytes(), the current 2.0 code results in getBytes() being 
called n times, n being length of input. getBytes() involves memory copy is 
thus expensive and causes a performance degradation.
Fix is to evaluate getBytes() before the for loop.

  was:
In the following code in `VectorizedHashMapGenerator.scala`:
```
def hashBytes(b: String): String = {
  val hash = ctx.freshName("hash")
  val bytes = ctx.freshName("bytes")
  s"""
 |int $result = 0;
 |byte[] $bytes = $b;
 |for (int i = 0; i < $bytes.length; i++) {
 |  ${genComputeHash(ctx, s"$bytes[i]", ByteType, hash)}
 |  $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + 
($result >>> 2);
 |}
   """.stripMargin
}

```
when b=input.getBytes(), the current 2.0 code results in getBytes() being 
called n times, n being length of input. getBytes() involves memory copy is 
thus expensive and causes a performance degradation.
Fix is to evaluate getBytes() before the for loop.


> Fix performance bug in hash aggregate on long string keys
> -
>
> Key: SPARK-16699
> URL: https://issues.apache.org/jira/browse/SPARK-16699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Qifan Pu
> Fix For: 2.0.0
>
>
> In the following code in `VectorizedHashMapGenerator.scala`:
> ```
> def hashBytes(b: String): String = {
>   val hash = ctx.freshName("hash")
>   s"""
>  |int $result = 0;
>  |for (int i = 0; i < $b.length; i++) {
>  |  ${genComputeHash(ctx, s"$b[i]", ByteType, hash)}
>  |  $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + 
> ($result >>> 2);
>  |}
>""".stripMargin
> }
> ```
> when b=input.getBytes(), the current 2.0 code results in getBytes() being 
> called n times, n being length of input. getBytes() involves memory copy is 
> thus expensive and causes a performance degradation.
> Fix is to evaluate getBytes() before the for loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16699) Fix performance bug in hash aggregate on long string keys

2016-07-24 Thread Qifan Pu (JIRA)

Qifan Pu created SPARK-16699:


 Summary: Fix performance bug in hash aggregate on long string keys
 Key: SPARK-16699
 URL: https://issues.apache.org/jira/browse/SPARK-16699
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Qifan Pu
 Fix For: 2.0.0


In the following code in `VectorizedHashMapGenerator.scala`:
```
def hashBytes(b: String): String = {
  val hash = ctx.freshName("hash")
  val bytes = ctx.freshName("bytes")
  s"""
 |int $result = 0;
 |byte[] $bytes = $b;
 |for (int i = 0; i < $bytes.length; i++) {
 |  ${genComputeHash(ctx, s"$bytes[i]", ByteType, hash)}
 |  $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + 
($result >>> 2);
 |}
   """.stripMargin
}

```
when b=input.getBytes(), the current 2.0 code results in getBytes() being 
called n times, n being length of input. getBytes() involves memory copy is 
thus expensive and causes a performance degradation.
Fix is to evaluate getBytes() before the for loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16685) audit release docs are ambiguous

2016-07-24 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391168#comment-15391168
 ] 

Patrick Wendell commented on SPARK-16685:
-

These scripts are pretty old and I'm not sure if anyone still uses them. I had 
written them a while back as sanity tests for some release builds. Today, those 
things are tested broadly by the community so I think this has become 
redundant. [~rxin] are these still used? If not, it might be good to remove 
them from the source repo.

> audit release docs are ambiguous
> 
>
> Key: SPARK-16685
> URL: https://issues.apache.org/jira/browse/SPARK-16685
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 1.6.2
>Reporter: jay vyas
>Priority: Minor
>
> The dev/audit-release tooling is ambiguous.
> - should it run against a real cluster? if so when?
> - what should be in the release repo?  Just jars? tarballs?  ( i assume jars 
> because its a .ivy, but not sure).
> - 
> https://github.com/apache/spark/tree/master/dev/audit-release



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16698) json parsing regression - "." in keys

2016-07-24 Thread TobiasP (JIRA)

TobiasP created SPARK-16698:
---

 Summary: json parsing regression - "." in keys
 Key: SPARK-16698
 URL: https://issues.apache.org/jira/browse/SPARK-16698
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: TobiasP


The commit 83775bc78e183791f75a99cdfbcd68a67ca0d472 "\[SPARK-14158]\[SQL] 
implement buildReader for json data source" breaks parsing of json files with 
"." in keys.

E.g. the test input for spark-solr 
https://github.com/lucidworks/spark-solr/blob/master/src/test/resources/test-data/events.json

{noformat}
scala> 
sqlContext.read.json("src/test/resources/test-data/events.json").collectAsList
org.apache.spark.sql.AnalysisException: Unable to resolve params.title_s given 
[_version_, count_l, doc_id_s, flag_s, id, params.title_s, params.url_s, 
session_id_s, timestamp_tdt, type_s, tz_timestamp_txt, user_id_s];
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:130)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:126)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at scala.collection.Iterator$class.foreach(Iterator.scala:742)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
  at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
  at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126)
  at 
org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:80)
  at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
  at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
  at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
  at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
  at 
org.apache.spark.sql.execution.SparkStrategies$SpecialLimits$.apply(SparkStrategies.scala:53)
  at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
  at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
  at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
  at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:52)
  at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:50)
  at 
org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:57)
  at 
org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:57)
  at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2321)
  at org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2040)
  ... 49 elided
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16589) Chained cartesian produces incorrect number of records

2016-07-24 Thread Maciej Szymkiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391108#comment-15391108
 ] 

Maciej Szymkiewicz commented on SPARK-16589:


[~holdenk] Makes sense. I was thinking more about design than other possible 
issues but it is probably better safe than sorry. It still should be fixed as 
fast as possible. It is really ugly bug and is easy to miss.

I doubt there are many legitimate cases when one can do something like this 
though (I guess this is why it hasn't been reported before).

> Chained cartesian produces incorrect number of records
> --
>
> Key: SPARK-16589
> URL: https://issues.apache.org/jira/browse/SPARK-16589
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>
> Chaining cartesian calls in PySpark results in the number of records lower 
> than expected. It can be reproduced as follows:
> {code}
> rdd = sc.parallelize(range(10), 1)
> rdd.cartesian(rdd).cartesian(rdd).count()
> ## 355
> rdd.cartesian(rdd).cartesian(rdd).distinct().count()
> ## 251
> {code}
> It looks like it is related to serialization. If we reserialize after initial 
> cartesian:
> {code}
> rdd.cartesian(rdd)._reserialize(BatchedSerializer(PickleSerializer(), 
> 1)).cartesian(rdd).count()
> ## 1000
> {code}
> or insert identity map:
> {code}
> rdd.cartesian(rdd).map(lambda x: x).cartesian(rdd).count()
> ## 1000
> {code}
> it yields correct results.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12378) CREATE EXTERNAL TABLE AS SELECT EXPORT AWS S3 ERROR

2016-07-24 Thread Chandana Sapparapu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15387889#comment-15387889
 ] 

Chandana Sapparapu edited comment on SPARK-12378 at 7/24/16 3:46 PM:
-

UPDATE - For temporary table, I don't have this issue.
For external table - the following syntax seems to work.

FROM (select a.* ,b.* from TBL_A a , TBL_B b where a.id=b.id) INPUT
insert overwrite TABLE MY_EXTERNAL_TABLE PARTITION (PARTITION_DT='2016-07-24' , 
SUB_PARTITION_DT='2016-07-24')
select *
where flg_1='N';

--

Environment:
Spark 1.6.2
AWS EMR 4.7.2 
Hive execution 1.2.1
Metastore: 1.0.0

Query :
insert into table (managed) 
select a.*, b.* from external_table1 a 
LEFT OUTER JOIN external_table2 b
on a.id=b.id;

The statement succeeds and some data is written into this managed table, but it 
throws the following exception which stops the subsequent statements from 
getting executed.


java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:442)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply$mcV$sp(ClientWrapper.scala:557)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply(ClientWrapper.scala:557)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply(ClientWrapper.scala:557)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:290)
at 
org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:237)
at 
org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:236)
at 
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:279)
at 
org.apache.spark.sql.hive.client.ClientWrapper.loadTable(ClientWrapper.scala:556)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:256)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:311)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter 
table. Invalid method name: 'alter_table_with_cascade'
at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:50

[jira] [Commented] (SPARK-16695) compile spark with scala 2.11 is not working (with Sbt)

2016-07-24 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391088#comment-15391088
 ] 

Sean Owen commented on SPARK-16695:
---

`sbt -Dscala-2.11 package` works for 1.6.2 on my Mac. So does `dependencyTree`. 
The plugin is not in namespace net.virtualvoid but net.virtual-void. I imagine 
you have some other problem that's not Spark related? or at least, you haven't 
reported an error here.

> compile spark with scala 2.11 is not working (with Sbt)
> ---
>
> Key: SPARK-16695
> URL: https://issues.apache.org/jira/browse/SPARK-16695
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.6.2
> Environment: OS: Mac
>Reporter: Eran Mizrahi
>
> tried to compile spark 1.6.2 with sbt for Scala 2.11, but got compilation 
> error.
> it seems that there is a wrong import in file Project/SparkBuild.scala.
> instead of "net.virtualvoid.sbt.graph.Plugin.graphSettings", 
> it should be "net.virtualvoid.sbt.graph. 
> DependencyGraphSettings.graphSettings".
> Checkout source code of "sbt-dependency-graph" plugin:
> https://github.com/jrudolph/sbt-dependency-graph/blob/master/src/main/scala/net/virtualvoid/sbt/graph/DependencyGraphSettings.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3246) Support weighted SVMWithSGD for classification of unbalanced dataset

2016-07-24 Thread Mohamed Baddar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391082#comment-15391082
 ] 

Mohamed Baddar commented on SPARK-3246:
---

[~sheridanrawlins] Working on it soon, most probably on 1st of August

> Support weighted SVMWithSGD for classification of unbalanced dataset
> 
>
> Key: SPARK-3246
> URL: https://issues.apache.org/jira/browse/SPARK-3246
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 0.9.0, 1.0.2
>Reporter: mahesh bhole
>
> Please support  weighted SVMWithSGD  for binary classification of unbalanced 
> dataset.Though other options like undersampling or oversampling can be 
> used,It will be good if we can have a way to assign weights to minority 
> class. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16696) unused broadcast variables should call destroy instead of unpersist

2016-07-24 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-16696:
---
Issue Type: Improvement  (was: Bug)

> unused broadcast variables should call destroy instead of unpersist
> ---
>
> Key: SPARK-16696
> URL: https://issues.apache.org/jira/browse/SPARK-16696
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Weichen Xu
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Unused broadcast variables should call destroy() instead of unpersist() so 
> that the memory can released in time, even in driver-side.
> currently, several algorithm in ML, such as KMeans, Word2Vec, has this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16697) redundant RDD computation in LDAOptimizer

2016-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16697:


Assignee: Apache Spark

> redundant RDD computation in LDAOptimizer
> -
>
> Key: SPARK-16697
> URL: https://issues.apache.org/jira/browse/SPARK-16697
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In mllib.clustering.LDAOptimizer
> the submitMiniBatch method,
> the stats: RDD do not persist but the following code will use it twice.
> so it cause redundant computation on it.
> and there is another problem,
> the expElogbetaBc broadcast variable is unpersist too early,
> and the next statement 
> `
> val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat(val gammat: 
> BDM[Double] = breeze.linalg.DenseMatrix.vertcat(
>stats.map(_._2).flatMap(list => list).collect().map(_.toDenseMatrix): 
> _*)
> `
> will re-compute the stats RDD, it will use expElogbetaBc broadcast variable 
> again,
> so the  expElogbetaBc broadcast variable will be broadcast again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16697) redundant RDD computation in LDAOptimizer

2016-07-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391067#comment-15391067
 ] 

Apache Spark commented on SPARK-16697:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/14335

> redundant RDD computation in LDAOptimizer
> -
>
> Key: SPARK-16697
> URL: https://issues.apache.org/jira/browse/SPARK-16697
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In mllib.clustering.LDAOptimizer
> the submitMiniBatch method,
> the stats: RDD do not persist but the following code will use it twice.
> so it cause redundant computation on it.
> and there is another problem,
> the expElogbetaBc broadcast variable is unpersist too early,
> and the next statement 
> `
> val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat(val gammat: 
> BDM[Double] = breeze.linalg.DenseMatrix.vertcat(
>stats.map(_._2).flatMap(list => list).collect().map(_.toDenseMatrix): 
> _*)
> `
> will re-compute the stats RDD, it will use expElogbetaBc broadcast variable 
> again,
> so the  expElogbetaBc broadcast variable will be broadcast again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16697) redundant RDD computation in LDAOptimizer

2016-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16697:


Assignee: (was: Apache Spark)

> redundant RDD computation in LDAOptimizer
> -
>
> Key: SPARK-16697
> URL: https://issues.apache.org/jira/browse/SPARK-16697
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In mllib.clustering.LDAOptimizer
> the submitMiniBatch method,
> the stats: RDD do not persist but the following code will use it twice.
> so it cause redundant computation on it.
> and there is another problem,
> the expElogbetaBc broadcast variable is unpersist too early,
> and the next statement 
> `
> val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat(val gammat: 
> BDM[Double] = breeze.linalg.DenseMatrix.vertcat(
>stats.map(_._2).flatMap(list => list).collect().map(_.toDenseMatrix): 
> _*)
> `
> will re-compute the stats RDD, it will use expElogbetaBc broadcast variable 
> again,
> so the  expElogbetaBc broadcast variable will be broadcast again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16697) redundant RDD computation in LDAOptimizer

2016-07-24 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-16697:
---
Description: 
In mllib.clustering.LDAOptimizer
the submitMiniBatch method,

the stats: RDD do not persist but the following code will use it twice.
so it cause redundant computation on it.

and there is another problem,
the expElogbetaBc broadcast variable is unpersist too early,
and the next statement 
`
val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat(val gammat: 
BDM[Double] = breeze.linalg.DenseMatrix.vertcat(
   stats.map(_._2).flatMap(list => list).collect().map(_.toDenseMatrix): _*)
`
will re-compute the stats RDD, it will use expElogbetaBc broadcast variable 
again,
so the  expElogbetaBc broadcast variable will be broadcast again.



  was:
In mllib.clustering.LDAOptimizer
the submitMiniBatch method,

the stats: RDD do not persist but the following code will use it twice.
so it cause redundant computation on it.


> redundant RDD computation in LDAOptimizer
> -
>
> Key: SPARK-16697
> URL: https://issues.apache.org/jira/browse/SPARK-16697
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In mllib.clustering.LDAOptimizer
> the submitMiniBatch method,
> the stats: RDD do not persist but the following code will use it twice.
> so it cause redundant computation on it.
> and there is another problem,
> the expElogbetaBc broadcast variable is unpersist too early,
> and the next statement 
> `
> val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat(val gammat: 
> BDM[Double] = breeze.linalg.DenseMatrix.vertcat(
>stats.map(_._2).flatMap(list => list).collect().map(_.toDenseMatrix): 
> _*)
> `
> will re-compute the stats RDD, it will use expElogbetaBc broadcast variable 
> again,
> so the  expElogbetaBc broadcast variable will be broadcast again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16697) redundant RDD computation in LDAOptimizer

2016-07-24 Thread Weichen Xu (JIRA)

Weichen Xu created SPARK-16697:
--

 Summary: redundant RDD computation in LDAOptimizer
 Key: SPARK-16697
 URL: https://issues.apache.org/jira/browse/SPARK-16697
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.0.1, 2.1.0
Reporter: Weichen Xu


In mllib.clustering.LDAOptimizer
the submitMiniBatch method,

the stats: RDD do not persist but the following code will use it twice.
so it cause redundant computation on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16696) unused broadcast variables should call destroy instead of unpersist

2016-07-24 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-16696:
---
Description: 
Unused broadcast variables should call destroy() instead of unpersist() so that 
the memory can released in time, even in driver-side.

currently, several algorithm in ML, such as KMeans, Word2Vec, has this problem.

  was:
Unused broadcast variables should call destroy() instead of unpersist() so that 
the memory can released in time, even in driver-side.

currently, several algorithm in ML, such as KMeans, LDA, Word2Vec, has this 
problem. Several places unused broadcast vars even neither call unpersit() nor 
call destroy()


> unused broadcast variables should call destroy instead of unpersist
> ---
>
> Key: SPARK-16696
> URL: https://issues.apache.org/jira/browse/SPARK-16696
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Weichen Xu
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Unused broadcast variables should call destroy() instead of unpersist() so 
> that the memory can released in time, even in driver-side.
> currently, several algorithm in ML, such as KMeans, Word2Vec, has this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16696) unused broadcast variables should call destroy instead of unpersist

2016-07-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391050#comment-15391050
 ] 

Apache Spark commented on SPARK-16696:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/14333

> unused broadcast variables should call destroy instead of unpersist
> ---
>
> Key: SPARK-16696
> URL: https://issues.apache.org/jira/browse/SPARK-16696
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Weichen Xu
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Unused broadcast variables should call destroy() instead of unpersist() so 
> that the memory can released in time, even in driver-side.
> currently, several algorithm in ML, such as KMeans, LDA, Word2Vec, has this 
> problem. Several places unused broadcast vars even neither call unpersit() 
> nor call destroy()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16696) unused broadcast variables should call destroy instead of unpersist

2016-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16696:


Assignee: (was: Apache Spark)

> unused broadcast variables should call destroy instead of unpersist
> ---
>
> Key: SPARK-16696
> URL: https://issues.apache.org/jira/browse/SPARK-16696
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Weichen Xu
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Unused broadcast variables should call destroy() instead of unpersist() so 
> that the memory can released in time, even in driver-side.
> currently, several algorithm in ML, such as KMeans, LDA, Word2Vec, has this 
> problem. Several places unused broadcast vars even neither call unpersit() 
> nor call destroy()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16696) unused broadcast variables should call destroy instead of unpersist

2016-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16696:


Assignee: Apache Spark

> unused broadcast variables should call destroy instead of unpersist
> ---
>
> Key: SPARK-16696
> URL: https://issues.apache.org/jira/browse/SPARK-16696
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Unused broadcast variables should call destroy() instead of unpersist() so 
> that the memory can released in time, even in driver-side.
> currently, several algorithm in ML, such as KMeans, LDA, Word2Vec, has this 
> problem. Several places unused broadcast vars even neither call unpersit() 
> nor call destroy()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16696) unused broadcast variables should call destroy instead of unpersist

2016-07-24 Thread Weichen Xu (JIRA)

Weichen Xu created SPARK-16696:
--

 Summary: unused broadcast variables should call destroy instead of 
unpersist
 Key: SPARK-16696
 URL: https://issues.apache.org/jira/browse/SPARK-16696
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 2.0.1, 2.1.0
Reporter: Weichen Xu


Unused broadcast variables should call destroy() instead of unpersist() so that 
the memory can released in time, even in driver-side.

currently, several algorithm in ML, such as KMeans, LDA, Word2Vec, has this 
problem. Several places unused broadcast vars even neither call unpersit() nor 
call destroy()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16695) compile spark with scala 2.11 is not working (with Sbt)

2016-07-24 Thread Eran Mizrahi (JIRA)

Eran Mizrahi created SPARK-16695:


 Summary: compile spark with scala 2.11 is not working (with Sbt)
 Key: SPARK-16695
 URL: https://issues.apache.org/jira/browse/SPARK-16695
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.6.2
 Environment: OS: Mac
Reporter: Eran Mizrahi


tried to compile spark 1.6.2 with sbt for Scala 2.11, but got compilation error.

it seems that there is a wrong import in file Project/SparkBuild.scala.

instead of "net.virtualvoid.sbt.graph.Plugin.graphSettings", 
it should be "net.virtualvoid.sbt.graph. DependencyGraphSettings.graphSettings".


Checkout source code of "sbt-dependency-graph" plugin:

https://github.com/jrudolph/sbt-dependency-graph/blob/master/src/main/scala/net/virtualvoid/sbt/graph/DependencyGraphSettings.scala




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16416) Logging in shutdown hook does not work properly with Log4j 2.x

2016-07-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16416:
--
Assignee: Mikael Ståldal

> Logging in shutdown hook does not work properly with Log4j 2.x
> --
>
> Key: SPARK-16416
> URL: https://issues.apache.org/jira/browse/SPARK-16416
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: Mikael Ståldal
>Assignee: Mikael Ståldal
>Priority: Minor
> Fix For: 2.1.0
>
>
> Spark registers some shutdown hooks, and they log messages during shutdown:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/ShutdownHookManager.scala#L58
> Since the {{Logging}} trait creates SLF4J loggers lazily:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala#L47
> a SLF4J logger is created during the execution of the shutdown hook.
> This does not work when Log4j 2.x is used as SLF4J implementation:
> https://issues.apache.org/jira/browse/LOG4J2-1222
> Even though Log4j 2.6 handles this more gracefully than before, it still does 
> emit a warning and will not be able to process the log message properly.
> Proposed solution: make sure to eagerly create the SLF4J logger to be used in 
> shutdown hooks when registering the hook.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16416) Logging in shutdown hook does not work properly with Log4j 2.x

2016-07-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16416.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14320
[https://github.com/apache/spark/pull/14320]

> Logging in shutdown hook does not work properly with Log4j 2.x
> --
>
> Key: SPARK-16416
> URL: https://issues.apache.org/jira/browse/SPARK-16416
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: Mikael Ståldal
>Priority: Minor
> Fix For: 2.1.0
>
>
> Spark registers some shutdown hooks, and they log messages during shutdown:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/ShutdownHookManager.scala#L58
> Since the {{Logging}} trait creates SLF4J loggers lazily:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala#L47
> a SLF4J logger is created during the execution of the shutdown hook.
> This does not work when Log4j 2.x is used as SLF4J implementation:
> https://issues.apache.org/jira/browse/LOG4J2-1222
> Even though Log4j 2.6 handles this more gracefully than before, it still does 
> emit a warning and will not be able to process the log message properly.
> Proposed solution: make sure to eagerly create the SLF4J logger to be used in 
> shutdown hooks when registering the hook.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16667) Spark driver executor dont release unused memory

2016-07-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16667:
--
Target Version/s:   (was: 1.6.0)

> Spark driver executor dont release unused memory
> 
>
> Key: SPARK-16667
> URL: https://issues.apache.org/jira/browse/SPARK-16667
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.6.0
> Environment: Ubuntu wily 64 bits
> java 1.8
> 3 slaves(4GB) 1 master(2GB) virtual machines in Vmware over i7 4th generation 
> with 16 gb RAM)
>Reporter: Luis Angel Hernández Acosta
>
> I'm running spark app in standalone cluster. My app create sparkContext and 
> make many calculation with graphx over the time. To calculate, my app create 
> new java thread and wait for it's ending signal. Betwenn any calculation, 
> memory grows 50mb-100mb. I make a thread to be sure that any object created 
> for calculate is destryed after calculate's end, but memory still growing. I 
> tray stoping the sparkContext and all executor memory allocated by app is 
> freed but my driver's memory still growing same 50m-100mb.
> My graph calculaiton include hdfs seralization of rdd and load graph from hdfs
> Spark env:
> export SPARK_MASTER_IP=master
> export SPARK_WORKER_CORES=4
> export SPARK_WORKER_MEMORY=2919m
> export SPARK_WORKER_INSTANCES=1
> export SPARK_DAEMON_MEMORY=256m
> export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true 
> -Dspark.worker.cleanup.interval=10"
> That are my only configurations



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16692) multilabel classification to DataFrame, ML

2016-07-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16692:
--
Target Version/s:   (was: 1.6.0)

>  multilabel classification to DataFrame, ML
> ---
>
> Key: SPARK-16692
> URL: https://issues.apache.org/jira/browse/SPARK-16692
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Weizhi Li
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> For the multi labels evaluations. There is a method to in MLlib named 
> MultilabelMetrics: A multilabel classification problem involves mapping each 
> sample in a dataset to a set of class labels. In this type of classification 
> problem, the labels are not mutually exclusive. For example, when classifying 
> a set of news articles into topics, a single article might be both science 
> and politics.
> Added this method to support DataFrame in ML. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16463) Support `truncate` option in Overwrite mode for JDBC DataFrameWriter

2016-07-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16463:
--
Assignee: Dongjoon Hyun

>  Support `truncate` option in Overwrite mode for JDBC DataFrameWriter
> -
>
> Key: SPARK-16463
> URL: https://issues.apache.org/jira/browse/SPARK-16463
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.1.0
>
>
> This issue adds a boolean option, `truncate`, for SaveMode.Overwrite of JDBC 
> DataFrameWriter. If this option is `true`, it use `TRUNCATE TABLE` instead of 
> `DROP TABLE`.
> - Without CREATE/DROP privilege, we can save dataframe to database. Sometime 
> these are not allowed for security.
> - It will keep the existing table information, so users can add and keep some 
> additional CONSTRAINTs for the table.
> - Sometime, TRUNCATE is faster than the combination of DROP/CREATE.
> This issue is different from SPARK-16410 which aims to use `TRUNCATE` only 
> for JDBC sources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16410) DataFrameWriter's jdbc method drops table in overwrite mode

2016-07-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16410.
---
Resolution: Duplicate

> DataFrameWriter's jdbc method drops table in overwrite mode
> ---
>
> Key: SPARK-16410
> URL: https://issues.apache.org/jira/browse/SPARK-16410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.2
>Reporter: Ian Hellstrom
>
> According to the [API 
> documentation|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter],
>  the write mode {{overwrite}} should _overwrite the existing data_, which 
> suggests that the data is removed, i.e. the table is truncated. 
> However, that is now what happens in the [source 
> code|https://github.com/apache/spark/blob/0ad6ce7e54b1d8f5946dde652fa5341d15059158/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L421]:
> {code}
> if (mode == SaveMode.Overwrite && tableExists) {
> JdbcUtils.dropTable(conn, table)
> tableExists = false
>   }
> {code}
> This clearly shows that the table is first dropped and then recreated. This 
> causes two major issues:
> * Existing indexes, partitioning schemes, etc. are completely lost.
> * The case of identifiers may be changed without the user understanding why.
> In my opinion, the table should be truncated, not dropped. Overwriting data 
> is a DML operation and should not cause DDL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16463) Support `truncate` option in Overwrite mode for JDBC DataFrameWriter

2016-07-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16463.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14086
[https://github.com/apache/spark/pull/14086]

>  Support `truncate` option in Overwrite mode for JDBC DataFrameWriter
> -
>
> Key: SPARK-16463
> URL: https://issues.apache.org/jira/browse/SPARK-16463
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.1.0
>
>
> This issue adds a boolean option, `truncate`, for SaveMode.Overwrite of JDBC 
> DataFrameWriter. If this option is `true`, it use `TRUNCATE TABLE` instead of 
> `DROP TABLE`.
> - Without CREATE/DROP privilege, we can save dataframe to database. Sometime 
> these are not allowed for security.
> - It will keep the existing table information, so users can add and keep some 
> additional CONSTRAINTs for the table.
> - Sometime, TRUNCATE is faster than the combination of DROP/CREATE.
> This issue is different from SPARK-16410 which aims to use `TRUNCATE` only 
> for JDBC sources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16685) audit release docs are ambiguous

2016-07-24 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15390990#comment-15390990
 ] 

Sean Owen commented on SPARK-16685:
---

[~pwendell] I think you put this in place; do you know if {{audit-release}} is 
used anymore? it isn't invoked by any of the Spark machinery as far as I can 
tell, but maybe something Jenkins runs or that you had run in the past with a 
release. [~rxin] has touched this recently.

> audit release docs are ambiguous
> 
>
> Key: SPARK-16685
> URL: https://issues.apache.org/jira/browse/SPARK-16685
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 1.6.2
>Reporter: jay vyas
>Priority: Minor
>
> The dev/audit-release tooling is ambiguous.
> - should it run against a real cluster? if so when?
> - what should be in the release repo?  Just jars? tarballs?  ( i assume jars 
> because its a .ivy, but not sure).
> - 
> https://github.com/apache/spark/tree/master/dev/audit-release



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16541) SparkTC application could not shutdown successfully

2016-07-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16541.
---
Resolution: Cannot Reproduce

>  SparkTC application could not shutdown successfully
> 
>
> Key: SPARK-16541
> URL: https://issues.apache.org/jira/browse/SPARK-16541
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yesha Vora
>
> SparkTC application in yarn-client mode was stuck at 10% progress.
> {code} spark-submit  --class org.apache.spark.examples.SparkTC --master 
> yarn-client spark-examples-assembly_*.jar {code}
> It seems like SparkTC application tasks finished and printed "TC has 6254 
> edges.". after that while shutting down, spark application kept getting 
> "ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate"
> {code}
> 16/07/13 08:43:37 INFO DAGScheduler: ResultStage 283 (count at 
> SparkTC.scala:71) finished in 42.357 s
> 16/07/13 08:43:37 INFO DAGScheduler: Job 13 finished: count at 
> SparkTC.scala:71, took 43.137408 s
> TC has 6254 edges.
> 16/07/13 08:43:37 INFO ServerConnector: Stopped 
> ServerConnector@5e0054a2{HTTP/1.1}{0.0.0.0:4040}
> 16/07/13 08:43:37 INFO ContextHandler: Stopped 
> o.s.j.s.ServletContextHandler@7350a22{/stages/stage/kill,null,UNAVAILABLE}
> 16/07/13 08:43:37 INFO ContextHandler: Stopped 
> o.s.j.s.ServletContextHandler@54d56a49{/api,null,UNAVAILABLE}
> 16/07/13 08:43:37 INFO ContextHandler: Stopped 
> o.s.j.s.ServletContextHandler@2fd52e57{/,null,UNAVAILABLE}
> 16/07/13 08:43:37 INFO ContextHandler: Stopped 
> o.s.j.s.ServletContextHandler@7737ff3{/static,null,UNAVAILABLE}
> 16/07/13 08:43:37 INFO ContextHandler: Stopped 
> o.s.j.s.ServletContextHandler@499d9067{/executors/threadDump/json,null,UNAVAILABLE}
> 16/07/13 08:43:37 INFO ContextHandler: Stopped 
> o.s.j.s.ServletContextHandler@40d0c2af{/executors/threadDump,null,UNAVAILABLE}
> 16/07/13 08:43:37 INFO ContextHandler: Stopped 
> o.s.j.s.ServletContextHandler@44ce4013{/executors/json,null,UNAVAILABLE}
> 16/07/13 08:43:37 INFO ContextHandler: Stopped 
> o.s.j.s.ServletContextHandler@59c9a28a{/executors,null,UNAVAILABLE}
> 16/07/13 08:43:37 INFO ContextHandler: Stopped 
> o.s.j.s.ServletContextHandler@2e784443{/environment/json,null,UNAVAILABLE}
> 16/07/13 08:43:37 INFO ContextHandler: Stopped 
> o.s.j.s.ServletContextHandler@10240ba4{/environment,null,UNAVAILABLE}
> 16/07/13 08:43:37 INFO ContextHandler: Stopped 
> o.s.j.s.ServletContextHandler@4ee2dd22{/storage/rdd/json,null,UNAVAILABLE}
> 16/07/13 08:43:37 INFO ContextHandler: Stopped 
> o.s.j.s.ServletContextHandler@79ab14cd{/storage/rdd,null,UNAVAILABLE}
> 16/07/13 08:43:37 INFO ContextHandler: Stopped 
> o.s.j.s.ServletContextHandler@731d1285{/storage/json,null,UNAVAILABLE}
> 16/07/13 08:43:37 INFO ContextHandler: Stopped 
> o.s.j.s.ServletContextHandler@72e46ea8{/storage,null,UNAVAILABLE}
> 16/07/13 08:43:37 INFO ContextHandler: Stopped 
> o.s.j.s.ServletContextHandler@266dcdd5{/stages/pool/json,null,UNAVAILABLE}
> 16/07/13 08:43:37 INFO ContextHandler: Stopped 
> o.s.j.s.ServletContextHandler@17ee6dd9{/stages/pool,null,UNAVAILABLE}
> 16/07/13 08:43:37 INFO ContextHandler: Stopped 
> o.s.j.s.ServletContextHandler@717867ea{/stages/stage/json,null,UNAVAILABLE}
> 16/07/13 08:43:37 INFO ContextHandler: Stopped 
> o.s.j.s.ServletContextHandler@74aaadcc{/stages/stage,null,UNAVAILABLE}
> 16/07/13 08:43:37 INFO ContextHandler: Stopped 
> o.s.j.s.ServletContextHandler@14f35a42{/stages/json,null,UNAVAILABLE}
> 16/07/13 08:43:37 INFO ContextHandler: Stopped 
> o.s.j.s.ServletContextHandler@27ec74f8{/stages,null,UNAVAILABLE}
> 16/07/13 08:43:37 INFO ContextHandler: Stopped 
> o.s.j.s.ServletContextHandler@148ad9f9{/jobs/job/json,null,UNAVAILABLE}
> 16/07/13 08:43:37 INFO ContextHandler: Stopped 
> o.s.j.s.ServletContextHandler@14445e4c{/jobs/job,null,UNAVAILABLE}
> 16/07/13 08:43:37 INFO ContextHandler: Stopped 
> o.s.j.s.ServletContextHandler@6d1557ff{/jobs/json,null,UNAVAILABLE}
> 16/07/13 08:43:37 INFO ContextHandler: Stopped 
> o.s.j.s.ServletContextHandler@aca62b1{/jobs,null,UNAVAILABLE}
> 16/07/13 08:43:37 INFO SparkUI: Stopped Spark web UI at http://xx.xx.xx:4040
> 16/07/13 08:43:37 ERROR LiveListenerBus: SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(2,WrappedArray())
> 16/07/13 08:43:37 ERROR LiveListenerBus: SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(1,WrappedArray())
> 16/07/13 08:43:47 ERROR LiveListenerBus: SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(2,WrappedArray())
> 16/07/13 08:43:47 ERROR LiveListenerBus: SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(1,WrappedArray())
>

[jira] [Updated] (SPARK-16664) Spark 1.6.2 - Persist call on Data frames with more than 200 columns is wiping out the data.

2016-07-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16664:
--
Target Version/s: 2.0.1

> Spark 1.6.2 - Persist call on Data frames with more than 200 columns is 
> wiping out the data.
> 
>
> Key: SPARK-16664
> URL: https://issues.apache.org/jira/browse/SPARK-16664
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: Satish Kolli
>Priority: Blocker
>
> Calling persist on a data frame with more than 200 columns is removing the 
> data from the data frame. This is an issue in Spark 1.6.2. Works with out any 
> issues in Spark 1.6.1
> Following test case demonstrates problem. Please let me know if you need any 
> additional information. Thanks.
> {code}
> import org.apache.spark._
> import org.apache.spark.rdd.RDD
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.{Row, SQLContext}
> import org.scalatest.FunSuite
> class TestSpark162_1 extends FunSuite {
>   test("test data frame with 200 columns") {
> val sparkConfig = new SparkConf()
> val parallelism = 5
> sparkConfig.set("spark.default.parallelism", s"$parallelism")
> sparkConfig.set("spark.sql.shuffle.partitions", s"$parallelism")
> val sc = new SparkContext(s"local[3]", "TestNestedJson", sparkConfig)
> val sqlContext = new SQLContext(sc)
> // create dataframe with 200 columns and fake 200 values
> val size = 200
> val rdd: RDD[Seq[Long]] = sc.parallelize(Seq(Seq.range(0, size)))
> val rowRdd: RDD[Row] = rdd.map(d => Row.fromSeq(d))
> val schemas = List.range(0, size).map(a => StructField("name"+ a, 
> LongType, true))
> val testSchema = StructType(schemas)
> val testDf = sqlContext.createDataFrame(rowRdd, testSchema)
> // test value
> assert(testDf.persist.take(1).apply(0).toSeq(100).asInstanceOf[Long] == 
> 100)
> sc.stop()
>   }
>   test("test data frame with 201 columns") {
> val sparkConfig = new SparkConf()
> val parallelism = 5
> sparkConfig.set("spark.default.parallelism", s"$parallelism")
> sparkConfig.set("spark.sql.shuffle.partitions", s"$parallelism")
> val sc = new SparkContext(s"local[3]", "TestNestedJson", sparkConfig)
> val sqlContext = new SQLContext(sc)
> // create dataframe with 201 columns and fake 201 values
> val size = 201
> val rdd: RDD[Seq[Long]] = sc.parallelize(Seq(Seq.range(0, size)))
> val rowRdd: RDD[Row] = rdd.map(d => Row.fromSeq(d))
> val schemas = List.range(0, size).map(a => StructField("name"+ a, 
> LongType, true))
> val testSchema = StructType(schemas)
> val testDf = sqlContext.createDataFrame(rowRdd, testSchema)
> // test value
> assert(testDf.persist.take(1).apply(0).toSeq(100).asInstanceOf[Long] == 
> 100)
> sc.stop()
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16676) Spark jobs stay in pending

2016-07-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16676.
---
Resolution: Not A Problem

If your executors aren't starting, then you have another earlier problem. You'd 
have to look at your logs to see why. For example, maybe you don't have enough 
resources available.

> Spark jobs stay in pending
> --
>
> Key: SPARK-16676
> URL: https://issues.apache.org/jira/browse/SPARK-16676
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Spark Shell
>Affects Versions: 1.5.2
> Environment: Mac OS X Yosemite, Terminal, Spark-shell standalone
>Reporter: Joe Chong
> Attachments: Spark UI stays @ pending.png
>
>
> I've been having issues executing certain Scala statements within the 
> Spark-Shell. These statements are obtained through tutorial/blog written by 
> Carol McDonald in MapR. 
> The import statements, reading text files into DataFrames are OK. However, 
> when I try to do df.show(), the execution hits a road block. Checking the 
> Spark UI job, I see that the Stage's active, however, 1 of its dependent job 
> stays in Pending without any movement. The logs are as below. 
> scala> fltCountsql.show()
> 16/07/22 11:40:16 INFO spark.SparkContext: Starting job: show at :46
> 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Registering RDD 31 (show at 
> :46)
> 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Got job 4 (show at 
> :46) with 200 output partitions
> 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Final stage: ResultStage 
> 8(show at :46)
> 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Parents of final stage: 
> List(ShuffleMapStage 7)
> 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Missing parents: 
> List(ShuffleMapStage 7)
> 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 7 
> (MapPartitionsRDD[31] at show at :46), which has no missing parents
> 16/07/22 11:40:16 INFO storage.MemoryStore: ensureFreeSpace(18128) called 
> with curMem=115755879, maxMem=2778495713
> 16/07/22 11:40:16 INFO storage.MemoryStore: Block broadcast_5 stored as 
> values in memory (estimated size 17.7 KB, free 2.5 GB)
> 16/07/22 11:40:16 INFO storage.MemoryStore: ensureFreeSpace(7527) called with 
> curMem=115774007, maxMem=2778495713
> 16/07/22 11:40:16 INFO storage.MemoryStore: Block broadcast_5_piece0 stored 
> as bytes in memory (estimated size 7.4 KB, free 2.5 GB)
> 16/07/22 11:40:16 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in 
> memory on localhost:61408 (size: 7.4 KB, free: 2.5 GB)
> 16/07/22 11:40:16 INFO spark.SparkContext: Created broadcast 5 from broadcast 
> at DAGScheduler.scala:861
> 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Submitting 2 missing tasks 
> from ShuffleMapStage 7 (MapPartitionsRDD[31] at show at :46)
> 16/07/22 11:40:16 INFO scheduler.TaskSchedulerImpl: Adding task set 7.0 with 
> 2 tasks
> 16/07/22 11:40:16 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 
> 7.0 (TID 4, localhost, PROCESS_LOCAL, 2156 bytes)
> 16/07/22 11:40:16 INFO executor.Executor: Running task 0.0 in stage 7.0 (TID 
> 4)
> 16/07/22 11:40:16 INFO storage.BlockManager: Found block rdd_2_0 locally
> 16/07/22 11:40:17 INFO executor.Executor: Finished task 0.0 in stage 7.0 (TID 
> 4). 2738 bytes result sent to driver
> 16/07/22 11:40:17 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 
> 7.0 (TID 4) in 920 ms on localhost (1/2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16573) executor stderr processing tools

2016-07-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16573.
---
Resolution: Won't Fix

> executor stderr  processing tools
> -
>
> Key: SPARK-16573
> URL: https://issues.apache.org/jira/browse/SPARK-16573
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
> Environment: spark 1.6.0
>Reporter: Norman He
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> 1) from eacch executor, I can view stderr of each one
> 2) I would like to filter all the stderr of every executor ( saying I am 
> interested #18 output of all executor ) and I would like to add the number in 
> them by grepping certain field and sum them up.
> 3) TensorBoard having this kind of capability



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16574) Distribute computing to each node based on certain hints

2016-07-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16574.
---
Resolution: Not A Problem

> Distribute computing to each node based on certain hints
> 
>
> Key: SPARK-16574
> URL: https://issues.apache.org/jira/browse/SPARK-16574
> Project: Spark
>  Issue Type: Wish
>Reporter: Norman He
>
> 1) I have gpuWorkers RDD like(each node have 2 gpus)
> val nodes= 10
> val gpuCount = 2
> val cross: Array[(Int, Int)] = for( x <- Array.range(0, nodes);  y <- 
>  Array.range(0, gpuCount ) ) yield (x, y)
> var gpuWorkers: RDD[(Int, Int)] = sc.parallelize(cross, nodes * gpuCount)
> 2) when executor runs, I would somehow like to distribute code to each nodes 
> based on cross's gpu index(y) so that each machine 2 gpu can be used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16601) Spark2.0 fail in creating table using sql statement "create table `db.tableName` xxx" while spark1.6 supports

2016-07-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16601.
---
Resolution: Not A Problem

> Spark2.0 fail in creating table using sql statement "create table 
> `db.tableName` xxx" while spark1.6 supports
> -
>
> Key: SPARK-16601
> URL: https://issues.apache.org/jira/browse/SPARK-16601
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: marymwu
>Priority: Minor
> Attachments: error log.png
>
>
> Spark2.0 fail in creating table using sql statement "create table 
> `db.tableName` xxx" while spark1.6 supports.
> error log is attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16694) Use for/foreach rather than map for Unit expressions whose side effects are required

2016-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16694:


Assignee: Apache Spark  (was: Sean Owen)

> Use for/foreach rather than map for Unit expressions whose side effects are 
> required
> 
>
> Key: SPARK-16694
> URL: https://issues.apache.org/jira/browse/SPARK-16694
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, MLlib, Spark Core, SQL, Streaming
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Minor
>
> {{map}} is misused in many places where {{foreach}} is intended. This caused 
> a bug in https://issues.apache.org/jira/browse/SPARK-16664 and might be a 
> latent bug elsewhere; it's also easy to find with IJ inspections. Worth 
> patching up. 
> To illustrate the general problem, {{map}} happens to work in Scala where the 
> collection isn't lazy, but will fail to execute the code when it is. {{map}} 
> also causes a collection of {{Unit}} to be created pointlessly.
> {code}
> scala> val foo = Seq(1,2,3)
> foo: Seq[Int] = List(1, 2, 3)
> scala> foo.map(println)
> 1
> 2
> 3
> res0: Seq[Unit] = List((), (), ())
> scala> foo.view.map(println)
> res1: scala.collection.SeqView[Unit,Seq[_]] = SeqViewM(...)
> scala> foo.view.foreach(println)
> 1
> 2
> 3
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16694) Use for/foreach rather than map for Unit expressions whose side effects are required

2016-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16694:


Assignee: Sean Owen  (was: Apache Spark)

> Use for/foreach rather than map for Unit expressions whose side effects are 
> required
> 
>
> Key: SPARK-16694
> URL: https://issues.apache.org/jira/browse/SPARK-16694
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, MLlib, Spark Core, SQL, Streaming
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> {{map}} is misused in many places where {{foreach}} is intended. This caused 
> a bug in https://issues.apache.org/jira/browse/SPARK-16664 and might be a 
> latent bug elsewhere; it's also easy to find with IJ inspections. Worth 
> patching up. 
> To illustrate the general problem, {{map}} happens to work in Scala where the 
> collection isn't lazy, but will fail to execute the code when it is. {{map}} 
> also causes a collection of {{Unit}} to be created pointlessly.
> {code}
> scala> val foo = Seq(1,2,3)
> foo: Seq[Int] = List(1, 2, 3)
> scala> foo.map(println)
> 1
> 2
> 3
> res0: Seq[Unit] = List((), (), ())
> scala> foo.view.map(println)
> res1: scala.collection.SeqView[Unit,Seq[_]] = SeqViewM(...)
> scala> foo.view.foreach(println)
> 1
> 2
> 3
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16694) Use for/foreach rather than map for Unit expressions whose side effects are required

2016-07-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15390982#comment-15390982
 ] 

Apache Spark commented on SPARK-16694:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/14332

> Use for/foreach rather than map for Unit expressions whose side effects are 
> required
> 
>
> Key: SPARK-16694
> URL: https://issues.apache.org/jira/browse/SPARK-16694
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, MLlib, Spark Core, SQL, Streaming
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> {{map}} is misused in many places where {{foreach}} is intended. This caused 
> a bug in https://issues.apache.org/jira/browse/SPARK-16664 and might be a 
> latent bug elsewhere; it's also easy to find with IJ inspections. Worth 
> patching up. 
> To illustrate the general problem, {{map}} happens to work in Scala where the 
> collection isn't lazy, but will fail to execute the code when it is. {{map}} 
> also causes a collection of {{Unit}} to be created pointlessly.
> {code}
> scala> val foo = Seq(1,2,3)
> foo: Seq[Int] = List(1, 2, 3)
> scala> foo.map(println)
> 1
> 2
> 3
> res0: Seq[Unit] = List((), (), ())
> scala> foo.view.map(println)
> res1: scala.collection.SeqView[Unit,Seq[_]] = SeqViewM(...)
> scala> foo.view.foreach(println)
> 1
> 2
> 3
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16694) Use for/foreach rather than map for Unit expressions whose side effects are required

2016-07-24 Thread Sean Owen (JIRA)

Sean Owen created SPARK-16694:
-

 Summary: Use for/foreach rather than map for Unit expressions 
whose side effects are required
 Key: SPARK-16694
 URL: https://issues.apache.org/jira/browse/SPARK-16694
 Project: Spark
  Issue Type: Improvement
  Components: Examples, MLlib, Spark Core, SQL, Streaming
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor


{{map}} is misused in many places where {{foreach}} is intended. This caused a 
bug in https://issues.apache.org/jira/browse/SPARK-16664 and might be a latent 
bug elsewhere; it's also easy to find with IJ inspections. Worth patching up. 

To illustrate the general problem, {{map}} happens to work in Scala where the 
collection isn't lazy, but will fail to execute the code when it is. {{map}} 
also causes a collection of {{Unit}} to be created pointlessly.

{code}
scala> val foo = Seq(1,2,3)
foo: Seq[Int] = List(1, 2, 3)

scala> foo.map(println)
1
2
3
res0: Seq[Unit] = List((), (), ())

scala> foo.view.map(println)
res1: scala.collection.SeqView[Unit,Seq[_]] = SeqViewM(...)

scala> foo.view.foreach(println)
1
2
3

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

79 matches

Mail list logo