[jira] [Created] (SPARK-13582) Improve performance of parquet reader with dictionary encoding

2016-02-29 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13582:
--

 Summary: Improve performance of parquet reader with dictionary 
encoding
 Key: SPARK-13582
 URL: https://issues.apache.org/jira/browse/SPARK-13582
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


Right now, we replace the ids with value from a dictionary before accessing a 
column. We could defer that, especially when some rows are filtered out, we 
will not lookup this dictionary for those rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13541) Flaky test: ParquetHadoopFsRelationSuite.test all data types - ByteType

2016-02-28 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171216#comment-15171216
 ] 

Davies Liu commented on SPARK-13541:



[~nongli]  Since the test failed in vectorized parquet reader, so I assigned 
this to you.

> Flaky test: ParquetHadoopFsRelationSuite.test all data types - ByteType
> ---
>
> Key: SPARK-13541
> URL: https://issues.apache.org/jira/browse/SPARK-13541
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Davies Liu
>Assignee: Nong Li
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52140/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-13415) Visualize subquery in SQL web UI

2016-02-28 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-13415:
---
Comment: was deleted

(was: 
[~nongli]  Since the test failed in vectorized parquet reader, so I assigned 
this to you.)

> Visualize subquery in SQL web UI
> 
>
> Key: SPARK-13415
> URL: https://issues.apache.org/jira/browse/SPARK-13415
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>
> Right now, uncorrelated scalar subqueries are not showed in SQL tab.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13415) Visualize subquery in SQL web UI

2016-02-28 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171215#comment-15171215
 ] 

Davies Liu commented on SPARK-13415:



[~nongli]  Since the test failed in vectorized parquet reader, so I assigned 
this to you.

> Visualize subquery in SQL web UI
> 
>
> Key: SPARK-13415
> URL: https://issues.apache.org/jira/browse/SPARK-13415
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>
> Right now, uncorrelated scalar subqueries are not showed in SQL tab.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13541) Flaky test: ParquetHadoopFsRelationSuite.test all data types - ByteType

2016-02-28 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13541:
--

 Summary: Flaky test: ParquetHadoopFsRelationSuite.test all data 
types - ByteType
 Key: SPARK-13541
 URL: https://issues.apache.org/jira/browse/SPARK-13541
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Davies Liu
Assignee: Nong Li


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52140/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13415) Visualize subquery in SQL web UI

2016-02-28 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-13415:
--

Assignee: Davies Liu

> Visualize subquery in SQL web UI
> 
>
> Key: SPARK-13415
> URL: https://issues.apache.org/jira/browse/SPARK-13415
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Right now, uncorrelated scalar subqueries are not showed in SQL tab.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13523) Reuse the exchanges in a query

2016-02-26 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13523:
--

 Summary: Reuse the exchanges in a query
 Key: SPARK-13523
 URL: https://issues.apache.org/jira/browse/SPARK-13523
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Davies Liu


In exchange, the RDD will be materialized (shuffled or collected), it's a good 
point to eliminate common part of a query.

In some TPCDS queries (for example, Q64), the same exchange (ShuffleExchange or 
BroadcastExchange) could be used multiple times, we should re-use them to avoid 
the duplicated work and reduce the memory for broadcast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13465) Add a task failure listener to TaskContext

2016-02-26 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13465.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11340
[https://github.com/apache/spark/pull/11340]

> Add a task failure listener to TaskContext
> --
>
> Key: SPARK-13465
> URL: https://issues.apache.org/jira/browse/SPARK-13465
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> TaskContext supports task completion callback, which gets called regardless 
> of task failures. However, there is no way for the listener to know if there 
> is an error. This ticket proposes adding a new listener that gets called when 
> a task fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13499) Optimize vectorized parquet reader for dictionary encoded data and RLE decoding

2016-02-26 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13499.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11375
[https://github.com/apache/spark/pull/11375]

> Optimize vectorized parquet reader for dictionary encoded data and RLE 
> decoding
> ---
>
> Key: SPARK-13499
> URL: https://issues.apache.org/jira/browse/SPARK-13499
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nong Li
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12313) getPartitionsByFilter doesnt handle predicates on all / multiple Partition Columns

2016-02-26 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12313.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11328
[https://github.com/apache/spark/pull/11328]

> getPartitionsByFilter doesnt handle predicates on all / multiple Partition 
> Columns
> --
>
> Key: SPARK-12313
> URL: https://issues.apache.org/jira/browse/SPARK-12313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Gobinathan SP
>Priority: Critical
> Fix For: 2.0.0
>
>
> When enabled spark.sql.hive.metastorePartitionPruning, the 
> getPartitionsByFilter is used
> For a table partitioned by p1 and p2, when triggered hc.sql("select col 
> from tabl1 where p1='p1V' and p2= 'p2V' ")
> The HiveShim identifies the Predicates and ConvertFilters returns p1='p1V' 
> and col2= 'p2V' . The same is passed to the getPartitionsByFilter method as 
> filter string.
> On these cases the partitions are not returned from Hive's 
> getPartitionsByFilter method. As a result, for the sql, the number of 
> returned rows is always zero. 
> However, filter on a single column always works. Probalbly  it doesn't come 
> through this route
> I'm using Oracle for Metstore V0.13.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13476) Generate does not always output UnsafeRow

2016-02-25 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13476.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11354
[https://github.com/apache/spark/pull/11354]

> Generate does not always output UnsafeRow
> -
>
> Key: SPARK-13476
> URL: https://issues.apache.org/jira/browse/SPARK-13476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> Generate does not output UnsafeRow when join is true



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13376) Improve column pruning

2016-02-25 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13376.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11354
[https://github.com/apache/spark/pull/11354]

> Improve column pruning
> --
>
> Key: SPARK-13376
> URL: https://issues.apache.org/jira/browse/SPARK-13376
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> Column pruning could help to skip to columns that are not used by any 
> following operators.
> The current implementation only work with a few of logical plans, we should 
> improve that to support all of them. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13250) Make vectorized parquet reader work as the build side of a broadcast join

2016-02-24 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13250.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11141
[https://github.com/apache/spark/pull/11141]

> Make vectorized parquet reader work as the build side of a broadcast join
> -
>
> Key: SPARK-13250
> URL: https://issues.apache.org/jira/browse/SPARK-13250
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Nong Li
> Fix For: 2.0.0
>
>
> The issue is that the build side requires unsafe row in certain 
> optimizations. The vectorized parquet reader explicitly does not want to 
> product unsafe rows in general.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13476) Generate does not always output UnsafeRow

2016-02-24 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13476:
--

 Summary: Generate does not always output UnsafeRow
 Key: SPARK-13476
 URL: https://issues.apache.org/jira/browse/SPARK-13476
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


Generate does not output UnsafeRow when join is true



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13467) abstract python function to simplify pyspark code

2016-02-24 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13467.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11342
[https://github.com/apache/spark/pull/11342]

> abstract python function to simplify pyspark code
> -
>
> Key: SPARK-13467
> URL: https://issues.apache.org/jira/browse/SPARK-13467
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Priority: Trivial
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13431) Maven build fails due to: Method code too large! in Catalyst

2016-02-23 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13431.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11331
[https://github.com/apache/spark/pull/11331]

> Maven build fails due to: Method code too large! in Catalyst
> 
>
> Key: SPARK-13431
> URL: https://issues.apache.org/jira/browse/SPARK-13431
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Stavros Kontopoulos
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Cannot build the project when run the normal build commands:
> eg.
> {code}
> build/mvn -Phadoop-2.6 -Dhadoop.version=2.6.0  clean package
> ./make-distribution.sh --name test --tgz -Phadoop-2.6 
> {code}
> Integration builds are also failing: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/229/console
> https://ci.typesafe.com/job/mit-docker-test-zk-ref/12/console
> It looks like this is the commit that introduced the issue:
> https://github.com/apache/spark/commit/7925071280bfa1570435bde3e93492eaf2167d56



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13376) Improve column pruning

2016-02-23 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13376.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11256
[https://github.com/apache/spark/pull/11256]

> Improve column pruning
> --
>
> Key: SPARK-13376
> URL: https://issues.apache.org/jira/browse/SPARK-13376
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> Column pruning could help to skip to columns that are not used by any 
> following operators.
> The current implementation only work with a few of logical plans, we should 
> improve that to support all of them. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13463) Support Column pruning for Dataset logical plan

2016-02-23 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13463:
--

 Summary: Support Column pruning for Dataset logical plan
 Key: SPARK-13463
 URL: https://issues.apache.org/jira/browse/SPARK-13463
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu


Column pruning may not work with some logical plan for Dataset, we need check 
that and make sure they works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13373) Generate code for sort merge join

2016-02-23 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13373.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11248
[https://github.com/apache/spark/pull/11248]

> Generate code for sort merge join
> -
>
> Key: SPARK-13373
> URL: https://issues.apache.org/jira/browse/SPARK-13373
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13431) Maven build fails due to: Method code too large! in Catalyst

2016-02-23 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159682#comment-15159682
 ] 

Davies Liu commented on SPARK-13431:


https://github.com/apache/spark/pull/11331

> Maven build fails due to: Method code too large! in Catalyst
> 
>
> Key: SPARK-13431
> URL: https://issues.apache.org/jira/browse/SPARK-13431
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Stavros Kontopoulos
>Priority: Blocker
>
> Cannot build the project when run the normal build commands:
> eg.
> {code}
> build/mvn -Phadoop-2.6 -Dhadoop.version=2.6.0  clean package
> ./make-distribution.sh --name test --tgz -Phadoop-2.6 
> {code}
> Integration builds are also failing: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/229/console
> https://ci.typesafe.com/job/mit-docker-test-zk-ref/12/console
> It looks like this is the commit that introduced the issue:
> https://github.com/apache/spark/commit/7925071280bfa1570435bde3e93492eaf2167d56



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13431) Maven build fails due to: Method code too large! in Catalyst

2016-02-23 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159670#comment-15159670
 ] 

Davies Liu commented on SPARK-13431:


I'd like to split ExpressionParser.g, or we can't touch it anymore (may break 
sbt break next time), unless we can switch to ANTR4 soon.

> Maven build fails due to: Method code too large! in Catalyst
> 
>
> Key: SPARK-13431
> URL: https://issues.apache.org/jira/browse/SPARK-13431
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Stavros Kontopoulos
>Priority: Blocker
>
> Cannot build the project when run the normal build commands:
> eg.
> {code}
> build/mvn -Phadoop-2.6 -Dhadoop.version=2.6.0  clean package
> ./make-distribution.sh --name test --tgz -Phadoop-2.6 
> {code}
> Integration builds are also failing: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/229/console
> https://ci.typesafe.com/job/mit-docker-test-zk-ref/12/console
> It looks like this is the commit that introduced the issue:
> https://github.com/apache/spark/commit/7925071280bfa1570435bde3e93492eaf2167d56



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13329) Considering output for statistics of logical plan

2016-02-23 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13329.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11210
[https://github.com/apache/spark/pull/11210]

> Considering output for statistics of logical plan
> -
>
> Key: SPARK-13329
> URL: https://issues.apache.org/jira/browse/SPARK-13329
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> The current implementation of statistics of UnaryNode does not considering 
> output (for example, Project), we should considering it to have a better 
> guess.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13358) Retrieve grep path when doing Benchmark

2016-02-23 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13358.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11231
[https://github.com/apache/spark/pull/11231]

> Retrieve grep path when doing Benchmark
> ---
>
> Key: SPARK-13358
> URL: https://issues.apache.org/jira/browse/SPARK-13358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13306) Uncorrelated scalar subquery

2016-02-20 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13306.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11190
[https://github.com/apache/spark/pull/11190]

> Uncorrelated scalar subquery
> 
>
> Key: SPARK-13306
> URL: https://issues.apache.org/jira/browse/SPARK-13306
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> A scalar subquery is a subquery that only generate single row and single 
> column, could be used as part of expression.
> Uncorrelated scalar subquery means it does not has a reference to external 
> table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13310) Missing Sorting Columns in Generate

2016-02-20 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13310.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11198
[https://github.com/apache/spark/pull/11198]

> Missing Sorting Columns in Generate
> ---
>
> Key: SPARK-13310
> URL: https://issues.apache.org/jira/browse/SPARK-13310
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
> Fix For: 2.0.0
>
>
> {code}
>   // case 1: missing sort columns are resolvable if join is true
>   sql("SELECT explode(a) AS val, b FROM data WHERE b < 2 order by val, c")
>   // case 2: missing sort columns are not resolvable if join is false. 
> Thus, issue a message in this case
>   sql("SELECT explode(a) AS val FROM data order by val, c")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13415) Visualize subquery on SQL tab

2016-02-20 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13415:
--

 Summary: Visualize subquery on SQL tab
 Key: SPARK-13415
 URL: https://issues.apache.org/jira/browse/SPARK-13415
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu


Right now, uncorrelated scalar subqueries are not showed in SQL tab.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13408) Exception in resultHandler will shutdown SparkContext

2016-02-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13408.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11280
[https://github.com/apache/spark/pull/11280]

> Exception in resultHandler will shutdown SparkContext
> -
>
> Key: SPARK-13408
> URL: https://issues.apache.org/jira/browse/SPARK-13408
> Project: Spark
>  Issue Type: Bug
>Reporter: Davies Liu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> {code}
> davies@localhost:~/work/spark$ bin/spark-submit 
> python/pyspark/sql/dataframe.py
> NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
> ahead of assembly.
> 16/02/19 12:46:24 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/02/19 12:46:24 WARN Utils: Your hostname, localhost resolves to a loopback 
> address: 127.0.0.1; using 192.168.0.143 instead (on interface en0)
> 16/02/19 12:46:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> **
> File 
> "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", 
> line 554, in pyspark.sql.dataframe.DataFrame.alias
> Failed example:
> joined_df.select(col("df_as1.name"), col("df_as2.name"), 
> col("df_as2.age")).collect()
> Differences (ndiff with -expected +actual):
> - [Row(name=u'Bob', name=u'Bob', age=5), Row(name=u'Alice', 
> name=u'Alice', age=2)]
> + [Row(name=u'Alice', name=u'Alice', age=2), Row(name=u'Bob', 
> name=u'Bob', age=5)]
> org.apache.spark.SparkDriverExecutionException: Execution error
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
>   at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:627)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:329)
>   at 
> org.apache.spark.util.BoundedPriorityQueue.$plus$eq(BoundedPriorityQueue.scala:47)
>   at 
> org.apache.spark.util.BoundedPriorityQueue$$anonfun$$plus$plus$eq$1.apply(BoundedPriorityQueue.scala:41)
>   at 
> org.apache.spark.util.BoundedPriorityQueue$$anonfun$$plus$plus$eq$1.apply(BoundedPriorityQueue.scala:41)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at 
> org.apache.spark.util.BoundedPriorityQueue.foreach(BoundedPriorityQueue.scala:31)
>   at 
> org.apache.spark.util.BoundedPriorityQueue.$plus$plus$eq(BoundedPriorityQueue.scala:41)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$apply$46.apply(RDD.scala:1319)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$apply$46.apply(RDD.scala:1318)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:932)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:929)
>   at 
> org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:57)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1185)
>   ... 4 more
> org.apache.spark.SparkDriverExecutionException: Execution error
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> Caused by: java.lang.NullPointerException
>   at 
> 

[jira] [Resolved] (SPARK-13304) Broadcast join with two ints could be very slow

2016-02-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13304.

   Resolution: Fixed
 Assignee: Davies Liu
Fix Version/s: 2.0.0

Fixed by https://github.com/apache/spark/pull/11130

> Broadcast join with two ints could be very slow
> ---
>
> Key: SPARK-13304
> URL: https://issues.apache.org/jira/browse/SPARK-13304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> If the two join columns have the same value, the hash code of them will be (a 
> ^ b), which is 0, then the HashMap will be very very slow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13409) Log the stacktrace when stopping a SparkContext

2016-02-19 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155439#comment-15155439
 ] 

Davies Liu commented on SPARK-13409:


[~rxin] I think we should remember the stacktrace, put that in the message when 
we tried to access the stopped the SparkContext, this will be much usefully 
than just logging it (hard to find the log).



We already did similar things when creating a SparkContext.

> Log the stacktrace when stopping a SparkContext
> ---
>
> Key: SPARK-13409
> URL: https://issues.apache.org/jira/browse/SPARK-13409
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>
> Somethings we saw a stopped SparkContext, then have no idea it's stopped by 
> what, we should log that for troubleshooting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13213) BroadcastNestedLoopJoin is very slow

2016-02-19 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155438#comment-15155438
 ] 

Davies Liu commented on SPARK-13213:


It depends, I'm open to any reasonable solution.

> BroadcastNestedLoopJoin is very slow
> 
>
> Key: SPARK-13213
> URL: https://issues.apache.org/jira/browse/SPARK-13213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> Since we have improve the performance of CartisianProduct, which should be 
> faster and robuster than BroacastNestedLoopJoin, we should do 
> CartisianProduct instead of BroacastNestedLoopJoin, especially  when the 
> broadcasted table is not that small.
> Today, we hit a query that take very long time but still not finished, once 
> decrease the threshold for broadcast (disable BroacastNestedLoopJoin), it 
> just finished in seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12567) Add aes_{encrypt,decrypt} UDFs

2016-02-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12567.

   Resolution: Fixed
 Assignee: Kai Jiang
Fix Version/s: 2.0.0

> Add aes_{encrypt,decrypt} UDFs
> --
>
> Key: SPARK-12567
> URL: https://issues.apache.org/jira/browse/SPARK-12567
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Kai Jiang
>Assignee: Kai Jiang
> Fix For: 2.0.0
>
>
> AES (Advanced Encryption Standard) algorithm.
> Add aes_encrypt and aes_decrypt UDFs.
> Ref:
> [Hive|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Misc.Functions]
> [MySQL|https://dev.mysql.com/doc/refman/5.5/en/encryption-functions.html#function_aes-decrypt]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12594) Outer Join Elimination by Filter Condition

2016-02-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12594.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10567
[https://github.com/apache/spark/pull/10567]

> Outer Join Elimination by Filter Condition
> --
>
> Key: SPARK-12594
> URL: https://issues.apache.org/jira/browse/SPARK-12594
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> Elimination of outer joins, if the predicates in the filter condition can 
> restrict the result sets so that all null-supplying rows are eliminated. 
> - full outer -> inner if both sides have such predicates
> - left outer -> inner if the right side has such predicates
> - right outer -> inner if the left side has such predicates
> - full outer -> left outer if only the left side has such predicates
> - full outer -> right outer if only the right side has such predicates
> If applicable, this can greatly improve the performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13409) Remember the stacktrace when stop a SparkContext

2016-02-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-13409:
---
Component/s: Spark Core

> Remember the stacktrace when stop a SparkContext
> 
>
> Key: SPARK-13409
> URL: https://issues.apache.org/jira/browse/SPARK-13409
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>
> Somethings we saw a stopped SparkContext, then have no idea it's stopped by 
> what, we should remember that for troubleshooting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13409) Remember the stacktrace when stop a SparkContext

2016-02-19 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13409:
--

 Summary: Remember the stacktrace when stop a SparkContext
 Key: SPARK-13409
 URL: https://issues.apache.org/jira/browse/SPARK-13409
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu


Somethings we saw a stopped SparkContext, then have no idea it's stopped by 
what, we should remember that for troubleshooting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13408) Exception in resultHandler will shutdown SparkContext

2016-02-19 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13408:
--

 Summary: Exception in resultHandler will shutdown SparkContext
 Key: SPARK-13408
 URL: https://issues.apache.org/jira/browse/SPARK-13408
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu
Assignee: Shixiong Zhu


{code}
davies@localhost:~/work/spark$ bin/spark-submit python/pyspark/sql/dataframe.py
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
ahead of assembly.
16/02/19 12:46:24 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
16/02/19 12:46:24 WARN Utils: Your hostname, localhost resolves to a loopback 
address: 127.0.0.1; using 192.168.0.143 instead (on interface en0)
16/02/19 12:46:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
**
File 
"/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", 
line 554, in pyspark.sql.dataframe.DataFrame.alias
Failed example:
joined_df.select(col("df_as1.name"), col("df_as2.name"), 
col("df_as2.age")).collect()
Differences (ndiff with -expected +actual):
- [Row(name=u'Bob', name=u'Bob', age=5), Row(name=u'Alice', name=u'Alice', 
age=2)]
+ [Row(name=u'Alice', name=u'Alice', age=2), Row(name=u'Bob', name=u'Bob', 
age=5)]
org.apache.spark.SparkDriverExecutionException: Execution error
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
at 
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649)
at java.util.PriorityQueue.siftUp(PriorityQueue.java:627)
at java.util.PriorityQueue.offer(PriorityQueue.java:329)
at 
org.apache.spark.util.BoundedPriorityQueue.$plus$eq(BoundedPriorityQueue.scala:47)
at 
org.apache.spark.util.BoundedPriorityQueue$$anonfun$$plus$plus$eq$1.apply(BoundedPriorityQueue.scala:41)
at 
org.apache.spark.util.BoundedPriorityQueue$$anonfun$$plus$plus$eq$1.apply(BoundedPriorityQueue.scala:41)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at 
org.apache.spark.util.BoundedPriorityQueue.foreach(BoundedPriorityQueue.scala:31)
at 
org.apache.spark.util.BoundedPriorityQueue.$plus$plus$eq(BoundedPriorityQueue.scala:41)
at 
org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$apply$46.apply(RDD.scala:1319)
at 
org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$apply$46.apply(RDD.scala:1318)
at 
org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:932)
at 
org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:929)
at 
org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:57)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1185)
... 4 more
org.apache.spark.SparkDriverExecutionException: Execution error
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
at 
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649)
at java.util.PriorityQueue.siftUp(PriorityQueue.java:627)
at 

[jira] [Created] (SPARK-13406) NPE in LazilyGeneratedOrdering

2016-02-19 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13406:
--

 Summary: NPE in LazilyGeneratedOrdering
 Key: SPARK-13406
 URL: https://issues.apache.org/jira/browse/SPARK-13406
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Josh Rosen


{code}
File 
"/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", 
line ?, in pyspark.sql.dataframe.DataFrameStatFunctions.sampleBy
Failed example:
sampled.groupBy("key").count().orderBy("key").show()
Exception raised:
Traceback (most recent call last):
  File "//anaconda/lib/python2.7/doctest.py", line 1315, in __run
compileflags, 1) in test.globs
  File "", line 1, in 
sampled.groupBy("key").count().orderBy("key").show()
  File 
"/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", 
line 217, in show
print(self._jdf.showString(n, truncate))
  File 
"/Users/davies/work/spark/python/lib/py4j-0.9.1-src.zip/py4j/java_gateway.py", 
line 835, in __call__
answer, self.gateway_client, self.target_id, self.name)
  File 
"/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 
45, in deco
return f(*a, **kw)
  File 
"/Users/davies/work/spark/python/lib/py4j-0.9.1-src.zip/py4j/protocol.py", line 
310, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o681.showString.
: org.apache.spark.SparkDriverExecutionException: Execution error
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:623)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1782)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:937)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:323)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:919)
at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1318)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:323)
at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1305)
at 
org.apache.spark.sql.execution.TakeOrderedAndProject.executeCollect(limit.scala:94)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:157)
at 
org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1520)
at 
org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1520)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
at 
org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1769)
at 
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1519)
at 
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1526)
at 
org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1396)
at 
org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1395)
at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:1782)
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1395)
at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1477)
at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:167)
at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:290)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at 

[jira] [Created] (SPARK-13404) Create the variables for input when it's used

2016-02-19 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13404:
--

 Summary: Create the variables for input when it's used
 Key: SPARK-13404
 URL: https://issues.apache.org/jira/browse/SPARK-13404
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


Right now, we create the variables in the first operator (usually 
InputAdapter), they could be wasted if most of rows after filtered out 
immediately.

We should defer that until they are used by following operators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13237) Generate broadcast outer join

2016-02-18 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13237.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11130
[https://github.com/apache/spark/pull/11130]

> Generate broadcast outer join
> -
>
> Key: SPARK-13237
> URL: https://issues.apache.org/jira/browse/SPARK-13237
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13376) Improve column pruning

2016-02-18 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13376:
--

 Summary: Improve column pruning
 Key: SPARK-13376
 URL: https://issues.apache.org/jira/browse/SPARK-13376
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


Column pruning could help to skip to columns that are not used by any following 
operators.

The current implementation only work with a few of logical plans, we should 
improve that to support all of them. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13373) Generate code for sort merge join

2016-02-17 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13373:
--

 Summary: Generate code for sort merge join
 Key: SPARK-13373
 URL: https://issues.apache.org/jira/browse/SPARK-13373
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13354) Push filter throughout outer join when the condition can filter out empty row

2016-02-17 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-13354.
--
Resolution: Duplicate

> Push filter throughout outer join when the condition can filter out empty row 
> --
>
> Key: SPARK-13354
> URL: https://issues.apache.org/jira/browse/SPARK-13354
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> For a query
> {code}
> select * from a left outer join b on a.a = b.a where b.b > 10
> {code}
> The condition `b.b > 10` will filter out all the row that the b part of it is 
> empty.
> In this case, we should use Inner join, and push down the filter into b.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13354) Push filter throughout outer join when the condition can filter out empty row

2016-02-16 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13354:
--

 Summary: Push filter throughout outer join when the condition can 
filter out empty row 
 Key: SPARK-13354
 URL: https://issues.apache.org/jira/browse/SPARK-13354
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu



For a query

{code}

select * from a left outer join b on a.a = b.a where b.b > 10

{code}

The condition `b.b > 10` will filter out all the row that the b part of it is 
empty.

In this case, we should use Inner join, and push down the filter into b.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13353) Use UnsafeRowSerializer to collect DataFrame

2016-02-16 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13353:
--

 Summary: Use UnsafeRowSerializer to collect DataFrame
 Key: SPARK-13353
 URL: https://issues.apache.org/jira/browse/SPARK-13353
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu


UnsafeRowSerializer should be more efficient than JavaSerializer or 
KyroSerializer for DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13352) BlockFetch does not scale well on large block

2016-02-16 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13352:
--

 Summary: BlockFetch does not scale well on large block
 Key: SPARK-13352
 URL: https://issues.apache.org/jira/browse/SPARK-13352
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Davies Liu


BlockManager.getRemoteBytes() perform poorly on large block

{code}
  test("block manager") {
val N = 500 << 20
val bm = sc.env.blockManager
val blockId = TaskResultBlockId(0)
val buffer = ByteBuffer.allocate(N)
buffer.limit(N)
bm.putBytes(blockId, buffer, StorageLevel.MEMORY_AND_DISK_SER)
val result = bm.getRemoteBytes(blockId)
assert(result.isDefined)
assert(result.get.limit() === (N))
  }
{code}

Here are runtime for different block sizes:
{code}
50M3 seconds
100M  7 seconds
250M  33 seconds
500M 2 min
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13351) Column pruning fails on expand

2016-02-16 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13351:
--

 Summary: Column pruning fails on expand
 Key: SPARK-13351
 URL: https://issues.apache.org/jira/browse/SPARK-13351
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


Optimizer can't pruning the columns in Expand.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13348) Avoid duplicated broadcasts

2016-02-16 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13348:
--

 Summary: Avoid duplicated broadcasts
 Key: SPARK-13348
 URL: https://issues.apache.org/jira/browse/SPARK-13348
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu


An broadcasted table could be used multiple times in a query, we should cache 
them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13347) Reuse the shuffle for duplicated exchange

2016-02-16 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13347:
--

 Summary: Reuse the shuffle for duplicated exchange
 Key: SPARK-13347
 URL: https://issues.apache.org/jira/browse/SPARK-13347
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu


In TPCDS query 47, the same exchange is used three times, we should re-use the 
ShuffleRowRDD to skip the duplicated stages.

{code}

 with v1 as(
 select i_category, i_brand,
s_store_name, s_company_name,
d_year, d_moy,
sum(ss_sales_price) sum_sales,
avg(sum(ss_sales_price)) over
  (partition by i_category, i_brand,
 s_store_name, s_company_name, d_year)
  avg_monthly_sales,
rank() over
  (partition by i_category, i_brand,
 s_store_name, s_company_name
   order by d_year, d_moy) rn
 from item, store_sales, date_dim, store
 where ss_item_sk = i_item_sk and
   ss_sold_date_sk = d_date_sk and
   ss_store_sk = s_store_sk and
   (
 d_year = 1999 or
 ( d_year = 1999-1 and d_moy =12) or
 ( d_year = 1999+1 and d_moy =1)
   )
 group by i_category, i_brand,
  s_store_name, s_company_name,
  d_year, d_moy),
 v2 as(
 select v1.i_category, v1.i_brand, v1.s_store_name, v1.s_company_name, 
v1.d_year,
 v1.d_moy, v1.avg_monthly_sales ,v1.sum_sales, 
v1_lag.sum_sales psum,
 v1_lead.sum_sales nsum
 from v1, v1 v1_lag, v1 v1_lead
 where v1.i_category = v1_lag.i_category and
   v1.i_category = v1_lead.i_category and
   v1.i_brand = v1_lag.i_brand and
   v1.i_brand = v1_lead.i_brand and
   v1.s_store_name = v1_lag.s_store_name and
   v1.s_store_name = v1_lead.s_store_name and
   v1.s_company_name = v1_lag.s_company_name and
   v1.s_company_name = v1_lead.s_company_name and
   v1.rn = v1_lag.rn + 1 and
   v1.rn = v1_lead.rn - 1)
 select * from v2
 where  d_year = 1999 and
avg_monthly_sales > 0 and
case when avg_monthly_sales > 0 then abs(sum_sales - avg_monthly_sales) 
/ avg_monthly_sales else null end > 0.1
 order by sum_sales - avg_monthly_sales, 3
 limit 100
{code}

Since the SparkPlan is just a tree (not DAG), we can only do this in 
SparkPlan.execute() or final rule.

And we should also have a way to compare two SparkPlan whether they have same 
result or not (they may have different exprId, we should compare them after 
bind).

An quick experiment showed that we could have 2X improvement on this query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13329) Considering output for statistics of logicol plan

2016-02-15 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13329:
--

 Summary: Considering output for statistics of logicol plan
 Key: SPARK-13329
 URL: https://issues.apache.org/jira/browse/SPARK-13329
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


The current implementation of statistics of UnaryNode does not considering 
output (for example, Project), we should considering it to have a better guess.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13323) Type cast support in type inference during merging types.

2016-02-15 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147591#comment-15147591
 ] 

Davies Liu commented on SPARK-13323:


HiveTypeCoercion is pretty complicated, we may don't want to duplicate that in 
Python.

What's the problem right? Or just because of the TODO?

> Type cast support in type inference during merging types.
> -
>
> Key: SPARK-13323
> URL: https://issues.apache.org/jira/browse/SPARK-13323
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> As described in {{types.py}}, there is a todo {{TODO: type cast (such as int 
> -> long)}}.
> Currently, PySpark infers types but does not try to find compatible types 
> when the given types are different during merging schemas.
> I think this can be done by resembling 
> {{HiveTypeCoercion.findTightestCommonTypeOfTwo}} for numbers and when one of 
> both is compared to {{StingType}}, just convert them into string.
> It looks the possible leaf data types are below:
> {code}
> # Mapping Python types to Spark SQL DataType
> _type_mappings = {
> type(None): NullType,
> bool: BooleanType,
> int: LongType,
> float: DoubleType,
> str: StringType,
> bytearray: BinaryType,
> decimal.Decimal: DecimalType,
> datetime.date: DateType,
> datetime.datetime: TimestampType,
> datetime.time: TimestampType,
> }
> {code}
> and they are converted pretty well to string as below:
> {code}
> >>> print str(None)
> None
> >>> print str(True)
> True
> >>> print str(float(0.1))
> 0.1
> >>> str(bytearray([255]))
> '\xff'
> >>> str(decimal.Decimal())
> '0'
> >>> str(datetime.date(1,1,1))
> '0001-01-01'
> >>> str(datetime.datetime(1,1,1))
> '0001-01-01 00:00:00'
> >>> str(datetime.time(1,1,1))
> '01:01:01'
> {code}
> First, I tried to find the relevant issue with this but I couldn't. Please 
> mark this as a duplicate if there is already.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13311) prettyString of IN is not good

2016-02-13 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13311:
--

 Summary: prettyString of IN is not good
 Key: SPARK-13311
 URL: https://issues.apache.org/jira/browse/SPARK-13311
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu


In(i_class,[Ljava.lang.Object;@1a575883))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12705) Sorting column can't be resolved if it's not in projection

2016-02-12 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12705.

Resolution: Fixed

Issue resolved by pull request 11153
[https://github.com/apache/spark/pull/11153]

> Sorting column can't be resolved if it's not in projection
> --
>
> Key: SPARK-12705
> URL: https://issues.apache.org/jira/browse/SPARK-12705
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> The following query can't be resolved:
> {code}
> scala> sqlContext.sql("select sum(a) over ()  from (select 1 as a, 2 as b) t 
> order by b").explain()
> org.apache.spark.sql.AnalysisException: cannot resolve 'b' given input 
> columns: [_c0]; line 1 pos 63
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:335)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:282)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:109)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:119)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:123)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13304) Broadcast join with two ints could be very slow

2016-02-12 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13304:
--

 Summary: Broadcast join with two ints could be very slow
 Key: SPARK-13304
 URL: https://issues.apache.org/jira/browse/SPARK-13304
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu


If the two join columns have the same value, the hash code of them will be (a ^ 
b), which is 0, then the HashMap will be very very slow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13306) Uncorrelated scalar subquery

2016-02-12 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13306:
--

 Summary: Uncorrelated scalar subquery
 Key: SPARK-13306
 URL: https://issues.apache.org/jira/browse/SPARK-13306
 Project: Spark
  Issue Type: New Feature
Reporter: Davies Liu
Assignee: Davies Liu


A scalar subquery is a subquery that only generate single row and single 
column, could be used as part of expression.

Uncorrelated scalar subquery means it does not has a reference to external 
table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12962) PySpark support covar_samp and covar_pop

2016-02-12 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12962.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10876
[https://github.com/apache/spark/pull/10876]

> PySpark support covar_samp and covar_pop
> 
>
> Key: SPARK-12962
> URL: https://issues.apache.org/jira/browse/SPARK-12962
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Yanbo Liang
>Priority: Minor
> Fix For: 2.0.0
>
>
> PySpark support covar_samp and covar_pop



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12544) Support window functions in SQLContext

2016-02-12 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12544.

   Resolution: Fixed
 Assignee: Herman van Hovell
Fix Version/s: 2.0.0

Since we updated the parser, Window function can work in SQLContext.

> Support window functions in SQLContext
> --
>
> Key: SPARK-12544
> URL: https://issues.apache.org/jira/browse/SPARK-12544
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Herman van Hovell
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13304) Broadcast join with two ints could be very slow

2016-02-12 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-13304:
---
Component/s: SQL

> Broadcast join with two ints could be very slow
> ---
>
> Key: SPARK-13304
> URL: https://issues.apache.org/jira/browse/SPARK-13304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>
> If the two join columns have the same value, the hash code of them will be (a 
> ^ b), which is 0, then the HashMap will be very very slow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13306) Uncorrelated scalar subquery

2016-02-12 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-13306:
---
Component/s: SQL

> Uncorrelated scalar subquery
> 
>
> Key: SPARK-13306
> URL: https://issues.apache.org/jira/browse/SPARK-13306
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> A scalar subquery is a subquery that only generate single row and single 
> column, could be used as part of expression.
> Uncorrelated scalar subquery means it does not has a reference to external 
> table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12544) Support window functions in SQLContext

2016-02-12 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145442#comment-15145442
 ] 

Davies Liu commented on SPARK-12544:


[~hvanhovell] Does window functions sill require HiveContext? Or we should 
update the docs/comments for Window functions.

> Support window functions in SQLContext
> --
>
> Key: SPARK-12544
> URL: https://issues.apache.org/jira/browse/SPARK-12544
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Herman van Hovell
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12544) Support window functions in SQLContext

2016-02-12 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145581#comment-15145581
 ] 

Davies Liu commented on SPARK-12544:


We are retiring HiveContext in 2.0, we may update the docs together. 

> Support window functions in SQLContext
> --
>
> Key: SPARK-12544
> URL: https://issues.apache.org/jira/browse/SPARK-12544
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Herman van Hovell
>  Labels: releasenotes
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12915) SQL metrics for generated operators

2016-02-11 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-12915:
--

Assignee: Davies Liu

> SQL metrics for generated operators
> ---
>
> Key: SPARK-12915
> URL: https://issues.apache.org/jira/browse/SPARK-12915
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> The metrics should be very efficient. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12949) Support common expression elimination

2016-02-11 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15143787#comment-15143787
 ] 

Davies Liu commented on SPARK-12949:


After some prototype, enable common expression elimination could have 10+% 
improvement on stddev, but 50% regression on Kurtosis, have not figure why, 
maybe JIT can already eliminate the common expressions (given the fact that 
Kurtosis is only 20% slower than stddev)? If yes, we may not want to do this. 

> Support common expression elimination
> -
>
> Key: SPARK-12949
> URL: https://issues.apache.org/jira/browse/SPARK-12949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13293) Generate code for Expand

2016-02-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13293:
--

 Summary: Generate code for Expand
 Key: SPARK-13293
 URL: https://issues.apache.org/jira/browse/SPARK-13293
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-12705) Sorting column can't be resolved if it's not in projection

2016-02-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reopened SPARK-12705:

  Assignee: Davies Liu  (was: Xiao Li)

The Q98 is still can't be analyzed, I will send a PR to fix that.

> Sorting column can't be resolved if it's not in projection
> --
>
> Key: SPARK-12705
> URL: https://issues.apache.org/jira/browse/SPARK-12705
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> The following query can't be resolved:
> {code}
> scala> sqlContext.sql("select sum(a) over ()  from (select 1 as a, 2 as b) t 
> order by b").explain()
> org.apache.spark.sql.AnalysisException: cannot resolve 'b' given input 
> columns: [_c0]; line 1 pos 63
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:335)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:282)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:109)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:119)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:123)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13174) Add API and options for csv data sources

2016-02-10 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15141438#comment-15141438
 ] 

Davies Liu commented on SPARK-13174:


[~GayathriMurali] Yes, there is a way, but it's not as good as other builtin 
datasources (like parquet, json, jdbc)

> Add API and options for csv data sources
> 
>
> Key: SPARK-13174
> URL: https://issues.apache.org/jira/browse/SPARK-13174
> Project: Spark
>  Issue Type: New Feature
>  Components: Input/Output
>Reporter: Davies Liu
>
> We should have a API to load csv data source (with some options as 
> arguments), similar to json() and jdbc()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12706) support grouping/grouping_id function together group set

2016-02-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12706.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10677
[https://github.com/apache/spark/pull/10677]

> support grouping/grouping_id function together group set
> 
>
> Key: SPARK-12706
> URL: https://issues.apache.org/jira/browse/SPARK-12706
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup#EnhancedAggregation,Cube,GroupingandRollup-Grouping__IDfunction
> http://etutorials.org/SQL/Mastering+Oracle+SQL/Chapter+13.+Advanced+Group+Operations/13.3+The+GROUPING_ID+and+GROUP_ID+Functions/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13234) Remove duplicated SQL metrics

2016-02-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13234.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11163
[https://github.com/apache/spark/pull/11163]

> Remove duplicated SQL metrics
> -
>
> Key: SPARK-13234
> URL: https://issues.apache.org/jira/browse/SPARK-13234
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
> Fix For: 2.0.0
>
>
> For lots of SQL operators, we have metrics for both of input and output, the 
> number of input rows should be exactly the number of output rows of child, we 
> could only have metrics for output rows.
> After we improve the performance using whole stage codegen, the overhead of 
> SQL metrics are not trivial anymore, we should avoid that if it's not 
> necessary.
> Some of the operator does not have SQL metrics, we should add that for them.
> For those operators that have the same number of rows from input and output 
> (for example, Projection, we may don't need that).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13249) Filter null keys for inner join

2016-02-09 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13249:
--

 Summary: Filter null keys for inner join
 Key: SPARK-13249
 URL: https://issues.apache.org/jira/browse/SPARK-13249
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu


For inner join, the join key with null in it will not match each other, so we 
could insert a Filter before inner join (could be pushed down), then we don't 
need to check nullability of keys while joining.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13173) Fail to load CSV file with NPE

2016-02-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-13173.
--
Resolution: Cannot Reproduce

> Fail to load CSV file with NPE
> --
>
> Key: SPARK-13173
> URL: https://issues.apache.org/jira/browse/SPARK-13173
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Reporter: Davies Liu
>
> {code}
> id|end_date|start_date|location
> 1|2015-10-14 00:00:00|2015-09-14 00:00:00|CA-SF
> 2|2015-10-15 01:00:20|2015-08-14 00:00:00|CA-SD
> 3|2015-10-16 02:30:00|2015-01-14 00:00:00|NY-NY
> 4|2015-10-17 03:00:20|2015-02-14 00:00:00|NY-NY
> 5|2015-10-18 04:30:00|2014-04-14 00:00:00|CA-SD
> {code}
> {code}
> adult_df = sqlContext.read.\
> format("org.apache.spark.sql.execution.datasources.csv").\
> option("header", "false").option("delimiter", "|").\
> option("inferSchema", "true").load("/tmp/dataframe_sample.csv")
> {code}
> {code}
> Py4JJavaError: An error occurred while calling o239.load.
> : java.lang.NullPointerException
>   at 
> scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
>   at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)
>   at 
> scala.collection.IndexedSeqOptimized$class.zipWithIndex(IndexedSeqOptimized.scala:93)
>   at 
> scala.collection.mutable.ArrayOps$ofRef.zipWithIndex(ArrayOps.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation.inferSchema(CSVRelation.scala:137)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation.dataSchema$lzycompute(CSVRelation.scala:50)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation.dataSchema(CSVRelation.scala:48)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:666)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:665)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:39)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:115)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:136)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
>   at py4j.Gateway.invoke(Gateway.java:290)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:209)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13173) Fail to load CSV file with NPE

2016-02-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-13173:
---
Description: 
{code}
id|end_date|start_date|location
1|2015-10-14 00:00:00|2015-09-14 00:00:00|CA-SF
2|2015-10-15 01:00:20|2015-08-14 00:00:00|CA-SD
3|2015-10-16 02:30:00|2015-01-14 00:00:00|NY-NY
4|2015-10-17 03:00:20|2015-02-14 00:00:00|NY-NY
5|2015-10-18 04:30:00|2014-04-14 00:00:00|CA-SD
{code}

{code}
adult_df = sqlContext.read.\
format("org.apache.spark.sql.execution.datasources.csv").\
option("header", "true").option("delimiter", "|").\
option("inferSchema", "true").load("/tmp/dataframe_sample.csv")
{code}

{code}
Py4JJavaError: An error occurred while calling o239.load.
: java.lang.NullPointerException
at 
scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)
at 
scala.collection.IndexedSeqOptimized$class.zipWithIndex(IndexedSeqOptimized.scala:93)
at 
scala.collection.mutable.ArrayOps$ofRef.zipWithIndex(ArrayOps.scala:108)
at 
org.apache.spark.sql.execution.datasources.csv.CSVRelation.inferSchema(CSVRelation.scala:137)
at 
org.apache.spark.sql.execution.datasources.csv.CSVRelation.dataSchema$lzycompute(CSVRelation.scala:50)
at 
org.apache.spark.sql.execution.datasources.csv.CSVRelation.dataSchema(CSVRelation.scala:48)
at 
org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:666)
at 
org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:665)
at 
org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:39)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:115)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:136)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:290)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
{code}

  was:
{code}
id|end_date|start_date|location
1|2015-10-14 00:00:00|2015-09-14 00:00:00|CA-SF
2|2015-10-15 01:00:20|2015-08-14 00:00:00|CA-SD
3|2015-10-16 02:30:00|2015-01-14 00:00:00|NY-NY
4|2015-10-17 03:00:20|2015-02-14 00:00:00|NY-NY
5|2015-10-18 04:30:00|2014-04-14 00:00:00|CA-SD
{code}

{code}
adult_df = sqlContext.read.\
format("org.apache.spark.sql.execution.datasources.csv").\
option("header", "false").option("delimiter", "|").\
option("inferSchema", "true").load("/tmp/dataframe_sample.csv")
{code}

{code}
Py4JJavaError: An error occurred while calling o239.load.
: java.lang.NullPointerException
at 
scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)
at 
scala.collection.IndexedSeqOptimized$class.zipWithIndex(IndexedSeqOptimized.scala:93)
at 
scala.collection.mutable.ArrayOps$ofRef.zipWithIndex(ArrayOps.scala:108)
at 
org.apache.spark.sql.execution.datasources.csv.CSVRelation.inferSchema(CSVRelation.scala:137)
at 
org.apache.spark.sql.execution.datasources.csv.CSVRelation.dataSchema$lzycompute(CSVRelation.scala:50)
at 
org.apache.spark.sql.execution.datasources.csv.CSVRelation.dataSchema(CSVRelation.scala:48)
at 
org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:666)
at 
org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:665)
at 
org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:39)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:115)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:136)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:290)
 

[jira] [Resolved] (SPARK-12950) Improve performance of BytesToBytesMap

2016-02-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12950.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11010
[https://github.com/apache/spark/pull/11010]

> Improve performance of BytesToBytesMap
> --
>
> Key: SPARK-12950
> URL: https://issues.apache.org/jira/browse/SPARK-12950
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
> Fix For: 2.0.0
>
>
> When benchmark generated aggregate with grouping keys, the profiling show 
> that lookup in BytesToBytesMap took about 90% of the CPU time, we should 
> optimize it.
> After profiling with jvisualvm, here are the things that take most of the 
> time:
> 1. decode address from Long to baseObject and offset
> 2. calculate hash code
> 3. compare the bytes (equality check)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12992) Vectorize parquet decoding using ColumnarBatch

2016-02-08 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12992.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11055
[https://github.com/apache/spark/pull/11055]

> Vectorize parquet decoding using ColumnarBatch
> --
>
> Key: SPARK-12992
> URL: https://issues.apache.org/jira/browse/SPARK-12992
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nong Li
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>
> Parquet files benefit from vectorized decoding. ColumnarBatches have been 
> designed to support this. This means that a single encoded parquet column is 
> decoded to a single ColumnVector. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8964) Use Exchange in limit operations (per partition limit -> exchange to one partition -> per partition limit)

2016-02-08 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8964.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 7334
[https://github.com/apache/spark/pull/7334]

> Use Exchange in limit operations (per partition limit -> exchange to one 
> partition -> per partition limit)
> --
>
> Key: SPARK-8964
> URL: https://issues.apache.org/jira/browse/SPARK-8964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
> Fix For: 2.0.0
>
>
> Spark SQL's physical Limit operator currently performs its own shuffle rather 
> than using Exchange to perform the shuffling.  This is less efficient since 
> this non-exchange shuffle path won't be able to benefit from SQL-specific 
> shuffling optimizations, such as SQLSerializer2.  It also involves additional 
> unnecessary row copying.
> Instead, I think that we should rewrite Limit to expand into three physical 
> operators:
> PerParititonLimit -> Exchange to one partition -> PerPartitionLimit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12840) Support passing arbitrary objects (not just expressions) into code generated classes

2016-02-08 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12840:
---
Component/s: SQL

> Support passing arbitrary objects (not just expressions) into code generated 
> classes
> 
>
> Key: SPARK-12840
> URL: https://issues.apache.org/jira/browse/SPARK-12840
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> As of now, our code generator only allows passing Expression objects into the 
> generated class as arguments. In order to support whole-stage codegen (e.g. 
> for broadcast joins), the generated classes need to accept other types of 
> objects such as hash tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13215) Remove fallback in codegen

2016-02-08 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-13215:
---
Component/s: SQL

> Remove fallback in codegen
> --
>
> Key: SPARK-13215
> URL: https://issues.apache.org/jira/browse/SPARK-13215
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> in newMutableProjection, it will fallback to InterpretedMutableProjection if 
> failed to compile.
> Since we remove the configuration for codegen, we are heavily reply on 
> codegen (also TungstenAggregate require the generated MutableProjection to 
> update UnsafeRow), should remove the fallback, which could make user 
> confusing, see the discussion in SPARK-13116.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13234) Remove duplicated SQL metrics

2016-02-08 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13234:
--

 Summary: Remove duplicated SQL metrics
 Key: SPARK-13234
 URL: https://issues.apache.org/jira/browse/SPARK-13234
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu


For lots of SQL operators, we have metrics for both of input and output, the 
number of input rows should be exactly the number of output rows of child, we 
could only have metrics for output rows.

After we improve the performance using whole stage codegen, the overhead of SQL 
metrics are not trivial anymore, we should avoid that if it's not necessary.

Some of the operator does not have SQL metrics, we should add that for them.

For those operators that have the same number of rows from input and output 
(for example, Projection, we may don't need that).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13213) BroadcastNestedLoopJoin is very slow

2016-02-08 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15137551#comment-15137551
 ] 

Davies Liu commented on SPARK-13213:


[~sowen] Thanks very much for update these, I try to remember to add that 
recently, but may still missed sometimes. Can we mark that as required (or 
remember the last action as default value)?  

> BroadcastNestedLoopJoin is very slow
> 
>
> Key: SPARK-13213
> URL: https://issues.apache.org/jira/browse/SPARK-13213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> Since we have improve the performance of CartisianProduct, which should be 
> faster and robuster than BroacastNestedLoopJoin, we should do 
> CartisianProduct instead of BroacastNestedLoopJoin, especially  when the 
> broadcasted table is not that small.
> Today, we hit a query that take very long time but still not finished, once 
> decrease the threshold for broadcast (disable BroacastNestedLoopJoin), it 
> just finished in seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12585) The numFields of UnsafeRow should not changed by pointTo()

2016-02-08 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12585:
---
Component/s: SQL

> The numFields of UnsafeRow should not changed by pointTo()
> --
>
> Key: SPARK-12585
> URL: https://issues.apache.org/jira/browse/SPARK-12585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>
> Right now, numFields will be passed in by pointTo(), then bitSetWidthInBytes 
> is calculated, making pointTo() a little bit heavy.
> It should be part of constructor of UnsafeRow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13095) improve performance of hash join with dimension table

2016-02-08 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13095.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11065
[https://github.com/apache/spark/pull/11065]

> improve performance of hash join with dimension table
> -
>
> Key: SPARK-13095
> URL: https://issues.apache.org/jira/browse/SPARK-13095
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> The join key is usually an integer or long (primary key, unique), we could 
> have special HashRelation for them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13237) Generate broadcast outer join

2016-02-08 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13237:
--

 Summary: Generate broadcast outer join
 Key: SPARK-13237
 URL: https://issues.apache.org/jira/browse/SPARK-13237
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13213) BroadcastNestedLoopJoin is very slow

2016-02-05 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13213:
--

 Summary: BroadcastNestedLoopJoin is very slow
 Key: SPARK-13213
 URL: https://issues.apache.org/jira/browse/SPARK-13213
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu


Since we have improve the performance of CartisianProduct, which should be 
faster and robuster than BroacastNestedLoopJoin, we should do CartisianProduct 
instead of BroacastNestedLoopJoin, especially  when the broadcasted table is 
not that small.

Today, we hit a query that take very long time but still not finished, once 
decrease the threshold for broadcast (disable BroacastNestedLoopJoin), it just 
finished in seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13215) Remove fallback in codegen

2016-02-05 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13215.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11097
[https://github.com/apache/spark/pull/11097]

> Remove fallback in codegen
> --
>
> Key: SPARK-13215
> URL: https://issues.apache.org/jira/browse/SPARK-13215
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
> Fix For: 2.0.0
>
>
> in newMutableProjection, it will fallback to InterpretedMutableProjection if 
> failed to compile.
> Since we remove the configuration for codegen, we are heavily reply on 
> codegen (also TungstenAggregate require the generated MutableProjection to 
> update UnsafeRow), should remove the fallback, which could make user 
> confusing, see the discussion in SPARK-13116.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13217) UI of visualization of stage

2016-02-05 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13217:
--

 Summary: UI of visualization of stage 
 Key: SPARK-13217
 URL: https://issues.apache.org/jira/browse/SPARK-13217
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Davies Liu


http://localhost:4041/stages/stage/?id=36=0=true

{code}
HTTP ERROR 500

Problem accessing /stages/stage/. Reason:

Server Error
Caused by:

java.lang.IncompatibleClassChangeError: vtable stub
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:102)
at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79)
at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79)
at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:79)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
at 
org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.GzipHandler.handle(GzipHandler.java:264)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:370)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at 
org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
Powered by Jetty://



















{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13217) UI of visualization of stage

2016-02-05 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-13217.
--
Resolution: Not A Problem

It works after restart the driver.

> UI of visualization of stage 
> -
>
> Key: SPARK-13217
> URL: https://issues.apache.org/jira/browse/SPARK-13217
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Davies Liu
>
> http://localhost:4041/stages/stage/?id=36=0=true
> {code}
> HTTP ERROR 500
> Problem accessing /stages/stage/. Reason:
> Server Error
> Caused by:
> java.lang.IncompatibleClassChangeError: vtable stub
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:102)
>   at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79)
>   at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79)
>   at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:79)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
>   at 
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>   at 
> org.eclipse.jetty.server.handler.GzipHandler.handle(GzipHandler.java:264)
>   at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>   at org.eclipse.jetty.server.Server.handle(Server.java:370)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
>   at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644)
>   at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
>   at 
> org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>   at java.lang.Thread.run(Thread.java:745)
> Powered by Jetty://
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13116) TungstenAggregate though it is supposedly capable of all processing unsafe & safe rows, fails if the input is safe rows

2016-02-05 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134979#comment-15134979
 ] 

Davies Liu commented on SPARK-13116:


I tried the test you attached, it work well on master.

> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows
> ---
>
> Key: SPARK-13116
> URL: https://issues.apache.org/jira/browse/SPARK-13116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Asif Hussain Shahid
> Attachments: SPARK_13116_Test.scala
>
>
> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows.
> If the input to TungstenAggregateIterator is a SafeRow, while the target is 
> an UnsafeRow ,  the current code will try to set the fields in the UnsafeRow 
> using the update method in UnSafeRow. 
> This method is called via TunsgtenAggregateIterator on the 
> InterpretedMutableProjection. The target row in the 
> InterpretedMutableProjection is an UnsafeRow, while the current row is a 
> SafeRow.
> In the InterpretedMutableProjection's apply method, it invokes
>  mutableRow(i) = exprArray(i).eval(input)
> Now for UnsafeRow, the update method throws UnsupportedOperationException.
> The proposed fix I did for our forked branch , on the class 
> InterpretedProjection is:
> +  private var targetUnsafe = false
>  +  type UnsafeSetter = (UnsafeRow,  Any ) => Unit
>  +  private var setters : Array[UnsafeSetter] = _
> private[this] val exprArray = expressions.toArray
> private[this] var mutableRow: MutableRow = new 
> GenericMutableRow(exprArray.length)
> def currentValue: InternalRow = mutableRow
>   
>  +
> override def target(row: MutableRow): MutableProjection = {
>   mutableRow = row
>  +targetUnsafe = row match {
>  +  case _:UnsafeRow =>{
>  +if(setters == null) {
>  +  setters = Array.ofDim[UnsafeSetter](exprArray.length)
>  +  for(i <- 0 until exprArray.length) {
>  +setters(i) = exprArray(i).dataType match {
>  +  case IntegerType => (target: UnsafeRow,  value: Any ) =>
>  +target.setInt(i,value.asInstanceOf[Int])
>  +  case LongType => (target: UnsafeRow,  value: Any ) =>
>  +target.setLong(i,value.asInstanceOf[Long])
>  +  case DoubleType => (target: UnsafeRow,  value: Any ) =>
>  +target.setDouble(i,value.asInstanceOf[Double])
>  +  case FloatType => (target: UnsafeRow, value: Any ) =>
>  +target.setFloat(i,value.asInstanceOf[Float])
>  +
>  +  case NullType => (target: UnsafeRow,  value: Any ) =>
>  +target.setNullAt(i)
>  +
>  +  case BooleanType => (target: UnsafeRow,  value: Any ) =>
>  +target.setBoolean(i,value.asInstanceOf[Boolean])
>  +
>  +  case ByteType => (target: UnsafeRow,  value: Any ) =>
>  +target.setByte(i,value.asInstanceOf[Byte])
>  +  case ShortType => (target: UnsafeRow, value: Any ) =>
>  +target.setShort(i,value.asInstanceOf[Short])
>  +
>  +}
>  +  }
>  +}
>  +true
>  +  }
>  +  case _ => false
>  +}
>  +
>   this
> }
>   
> override def apply(input: InternalRow): InternalRow = {
>   var i = 0
>   while (i < exprArray.length) {
>  -  mutableRow(i) = exprArray(i).eval(input)
>  +  if(targetUnsafe) {
>  +setters(i)(mutableRow.asInstanceOf[UnsafeRow], 
> exprArray(i).eval(input))
>  +  }else {
>  +mutableRow(i) = exprArray(i).eval(input)
>  +  }
> i += 1
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13215) Remove fallback in codegen

2016-02-05 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13215:
--

 Summary: Remove fallback in codegen
 Key: SPARK-13215
 URL: https://issues.apache.org/jira/browse/SPARK-13215
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu


in newMutableProjection, it will fallback to InterpretedMutableProjection if 
failed to compile.

Since we remove the configuration for codegen, we are heavily reply on codegen 
(also TungstenAggregate require the generated MutableProjection to update 
UnsafeRow), should remove the fallback, which could make user confusing, see 
the discussion in SPARK-13116.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13116) TungstenAggregate though it is supposedly capable of all processing unsafe & safe rows, fails if the input is safe rows

2016-02-05 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134959#comment-15134959
 ] 

Davies Liu commented on SPARK-13116:


Which UnaryExpression generate incorrect code? I think we may just remove the 
fallback.

> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows
> ---
>
> Key: SPARK-13116
> URL: https://issues.apache.org/jira/browse/SPARK-13116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Asif Hussain Shahid
> Attachments: SPARK_13116_Test.scala
>
>
> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows.
> If the input to TungstenAggregateIterator is a SafeRow, while the target is 
> an UnsafeRow ,  the current code will try to set the fields in the UnsafeRow 
> using the update method in UnSafeRow. 
> This method is called via TunsgtenAggregateIterator on the 
> InterpretedMutableProjection. The target row in the 
> InterpretedMutableProjection is an UnsafeRow, while the current row is a 
> SafeRow.
> In the InterpretedMutableProjection's apply method, it invokes
>  mutableRow(i) = exprArray(i).eval(input)
> Now for UnsafeRow, the update method throws UnsupportedOperationException.
> The proposed fix I did for our forked branch , on the class 
> InterpretedProjection is:
> +  private var targetUnsafe = false
>  +  type UnsafeSetter = (UnsafeRow,  Any ) => Unit
>  +  private var setters : Array[UnsafeSetter] = _
> private[this] val exprArray = expressions.toArray
> private[this] var mutableRow: MutableRow = new 
> GenericMutableRow(exprArray.length)
> def currentValue: InternalRow = mutableRow
>   
>  +
> override def target(row: MutableRow): MutableProjection = {
>   mutableRow = row
>  +targetUnsafe = row match {
>  +  case _:UnsafeRow =>{
>  +if(setters == null) {
>  +  setters = Array.ofDim[UnsafeSetter](exprArray.length)
>  +  for(i <- 0 until exprArray.length) {
>  +setters(i) = exprArray(i).dataType match {
>  +  case IntegerType => (target: UnsafeRow,  value: Any ) =>
>  +target.setInt(i,value.asInstanceOf[Int])
>  +  case LongType => (target: UnsafeRow,  value: Any ) =>
>  +target.setLong(i,value.asInstanceOf[Long])
>  +  case DoubleType => (target: UnsafeRow,  value: Any ) =>
>  +target.setDouble(i,value.asInstanceOf[Double])
>  +  case FloatType => (target: UnsafeRow, value: Any ) =>
>  +target.setFloat(i,value.asInstanceOf[Float])
>  +
>  +  case NullType => (target: UnsafeRow,  value: Any ) =>
>  +target.setNullAt(i)
>  +
>  +  case BooleanType => (target: UnsafeRow,  value: Any ) =>
>  +target.setBoolean(i,value.asInstanceOf[Boolean])
>  +
>  +  case ByteType => (target: UnsafeRow,  value: Any ) =>
>  +target.setByte(i,value.asInstanceOf[Byte])
>  +  case ShortType => (target: UnsafeRow, value: Any ) =>
>  +target.setShort(i,value.asInstanceOf[Short])
>  +
>  +}
>  +  }
>  +}
>  +true
>  +  }
>  +  case _ => false
>  +}
>  +
>   this
> }
>   
> override def apply(input: InternalRow): InternalRow = {
>   var i = 0
>   while (i < exprArray.length) {
>  -  mutableRow(i) = exprArray(i).eval(input)
>  +  if(targetUnsafe) {
>  +setters(i)(mutableRow.asInstanceOf[UnsafeRow], 
> exprArray(i).eval(input))
>  +  }else {
>  +mutableRow(i) = exprArray(i).eval(input)
>  +  }
> i += 1
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13210) NPE in Sort

2016-02-05 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-13210:
--

Assignee: Davies Liu

> NPE in Sort
> ---
>
> Key: SPARK-13210
> URL: https://issues.apache.org/jira/browse/SPARK-13210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> When run TPCDS query Q78 with scale 10:
> {code}
> 16/02/04 22:39:09 ERROR Executor: Managed memory leak detected; size = 
> 268435456 bytes, TID = 143
> 16/02/04 22:39:09 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID 
> 143)
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:333)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:60)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:39)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:239)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:415)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:116)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:87)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:60)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45)
>   at org.apache.spark.scheduler.Task.run(Task.scala:81)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13210) NPE in Sort

2016-02-04 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13210:
--

 Summary: NPE in Sort
 Key: SPARK-13210
 URL: https://issues.apache.org/jira/browse/SPARK-13210
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Davies Liu
Priority: Critical


When run TPCDS query Q78 with scale 10:

{code}
16/02/04 22:39:09 ERROR Executor: Managed memory leak detected; size = 
268435456 bytes, TID = 143
16/02/04 22:39:09 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID 143)
java.lang.NullPointerException
at 
org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:333)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:60)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:39)
at 
org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:239)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:415)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:116)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:87)
at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:60)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45)
at org.apache.spark.scheduler.Task.run(Task.scala:81)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13157) ADD JAR command cannot handle path with @ character

2016-02-03 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129994#comment-15129994
 ] 

Davies Liu commented on SPARK-13157:


THis is introduced by https://github.com/apache/spark/pull/10905/files

cc [~hvanhovell]

> ADD JAR command cannot handle path with @ character
> ---
>
> Key: SPARK-13157
> URL: https://issues.apache.org/jira/browse/SPARK-13157
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Priority: Blocker
>
> To reproduce this issue locally, copy {{TestUDTF.jar}} under 
> {{$SPARK_HOME/sql/hive/src/test/resources/TestUDTF.jar}} to 
> {{/tmp/a@b/TestUDTF.jar}}. Then start the Thrift server and run the following 
> commands using Beeline:
> {noformat}
> > add jar file:///tmp/a@b/TestUDTF.jar;
> ...
> > CREATE TEMPORARY FUNCTION udtf_count2 AS 
> > 'org.apache.spark.sql.hive.execution.GenericUDTFCount2';
> Error: org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
> Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.FunctionTask (state=,code=0)
> {noformat}
> Please refer to [this PR comment 
> thread|https://github.com/apache/spark/pull/11040] for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13157) ADD JAR command cannot handle path with @ character

2016-02-03 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129995#comment-15129995
 ] 

Davies Liu commented on SPARK-13157:


Could be reproduce by 

{code}

  test("path with @") {
val plan = parser.parsePlan("ADD JAR /a/b@1/c.jar")
assert(plan.isInstanceOf[AddJar])
assert(plan.asInstanceOf[AddJar].path === "/a/b@1/c.jar")
  }

{code}

> ADD JAR command cannot handle path with @ character
> ---
>
> Key: SPARK-13157
> URL: https://issues.apache.org/jira/browse/SPARK-13157
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Priority: Blocker
>
> To reproduce this issue locally, copy {{TestUDTF.jar}} under 
> {{$SPARK_HOME/sql/hive/src/test/resources/TestUDTF.jar}} to 
> {{/tmp/a@b/TestUDTF.jar}}. Then start the Thrift server and run the following 
> commands using Beeline:
> {noformat}
> > add jar file:///tmp/a@b/TestUDTF.jar;
> ...
> > CREATE TEMPORARY FUNCTION udtf_count2 AS 
> > 'org.apache.spark.sql.hive.execution.GenericUDTFCount2';
> Error: org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
> Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.FunctionTask (state=,code=0)
> {noformat}
> Please refer to [this PR comment 
> thread|https://github.com/apache/spark/pull/11040] for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13046) Partitioning looks broken in 1.6

2016-02-03 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131170#comment-15131170
 ] 

Davies Liu commented on SPARK-13046:


I tried Spark 1.6 and master with a directory like this
{code}
test_data
test_data/._common_metadata.crc
test_data/._metadata.crc
test_data/._SUCCESS.crc
test_data/_common_metadata
test_data/_metadata
test_data/_SUCCESS
test_data/id=0
test_data/id=0/id2=0
test_data/id=0/id2=0/.part-r-0-2be05edc-caf1-4d13-8fc9-db6f1b8f2d6d.gz.parquet.crc
test_data/id=0/id2=0/part-r-0-2be05edc-caf1-4d13-8fc9-db6f1b8f2d6d.gz.parquet
test_data/id=1
test_data/id=1/id2=1
test_data/id=1/id2=1/.part-r-0-2be05edc-caf1-4d13-8fc9-db6f1b8f2d6d.gz.parquet.crc
test_data/id=1/id2=1/part-r-0-2be05edc-caf1-4d13-8fc9-db6f1b8f2d6d.gz.parquet
test_data/id=10
test_data/id=10/id2=10
test_data/id=10/id2=10/.part-r-0-2be05edc-caf1-4d13-8fc9-db6f1b8f2d6d.gz.parquet.crc
test_data/id=10/id2=10/part-r-0-2be05edc-caf1-4d13-8fc9-db6f1b8f2d6d.gz.parquet
test_data/id=11
{code}

It works well, is there special files in s3://bucket/some_path ?

> Partitioning looks broken in 1.6
> 
>
> Key: SPARK-13046
> URL: https://issues.apache.org/jira/browse/SPARK-13046
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Julien Baley
>
> Hello,
> I have a list of files in s3:
> {code}
> s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
>  parquet files}
> s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
>  parquet files}
> s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
>  parquet files}
> {code}
> Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same 
> for the three lines) would correctly identify 2 pairs of key/value, one 
> `date_received` and one `fingerprint`.
> From 1.6.0, I get the following exception:
> {code}
> assertion failed: Conflicting directory structures detected. Suspicious paths
> s3://bucket/some_path/date_received=2016-01-13
> s3://bucket/some_path/date_received=2016-01-14
> s3://bucket/some_path/date_received=2016-01-15
> {code}
> That is to say, the partitioning code now fails to identify 
> date_received=2016-01-13 as a key/value pair.
> I can see that there has been some activity on 
> spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
>  recently, so that seems related (especially the commits 
> https://github.com/apache/spark/commit/7b5d9051cf91c099458d092a6705545899134b3b
>   and 
> https://github.com/apache/spark/commit/de289bf279e14e47859b5fbcd70e97b9d0759f14
>  ).
> If I read correctly the tests added in those commits:
> -they don't seem to actually test the return value, only that it doesn't crash
> -they only test cases where the s3 path contain 1 key/value pair (which 
> otherwise would catch the bug)
> This is problematic for us as we're trying to migrate all of our spark 
> services to 1.6.0 and this bug is a real blocker. I know it's possible to 
> force a 'union', but I'd rather not do that if the bug can be fixed.
> Any question, please shoot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13131) Use best time and average time in micro benchmark

2016-02-03 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-13131:
---
Summary: Use best  time and average time in micro benchmark  (was: Use 
median time in benchmark)

> Use best  time and average time in micro benchmark
> --
>
> Key: SPARK-13131
> URL: https://issues.apache.org/jira/browse/SPARK-13131
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Median time should be more stable than average time in benchmark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13131) Use best time and average time in micro benchmark

2016-02-03 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-13131:
---
Description: Best time should be more stable than average time in 
benchmark, together with average time, they could show more information.  (was: 
Median time should be more stable than average time in benchmark.)

> Use best  time and average time in micro benchmark
> --
>
> Key: SPARK-13131
> URL: https://issues.apache.org/jira/browse/SPARK-13131
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Best time should be more stable than average time in benchmark, together with 
> average time, they could show more information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13116) TungstenAggregate though it is supposedly capable of all processing unsafe & safe rows, fails if the input is safe rows

2016-02-03 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131138#comment-15131138
 ] 

Davies Liu commented on SPARK-13116:


Could you provide a test to reproduce this issue?

> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows
> ---
>
> Key: SPARK-13116
> URL: https://issues.apache.org/jira/browse/SPARK-13116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Asif Hussain Shahid
>
> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows.
> If the input to TungstenAggregateIterator is a SafeRow, while the target is 
> an UnsafeRow ,  the current code will try to set the fields in the UnsafeRow 
> using the update method in UnSafeRow. 
> This method is called via TunsgtenAggregateIterator on the 
> InterpretedMutableProjection. The target row in the 
> InterpretedMutableProjection is an UnsafeRow, while the current row is a 
> SafeRow.
> In the InterpretedMutableProjection's apply method, it invokes
>  mutableRow(i) = exprArray(i).eval(input)
> Now for UnsafeRow, the update method throws UnsupportedOperationException.
> The proposed fix I did for our forked branch , on the class 
> InterpretedProjection is:
> +  private var targetUnsafe = false
>  +  type UnsafeSetter = (UnsafeRow,  Any ) => Unit
>  +  private var setters : Array[UnsafeSetter] = _
> private[this] val exprArray = expressions.toArray
> private[this] var mutableRow: MutableRow = new 
> GenericMutableRow(exprArray.length)
> def currentValue: InternalRow = mutableRow
>   
>  +
> override def target(row: MutableRow): MutableProjection = {
>   mutableRow = row
>  +targetUnsafe = row match {
>  +  case _:UnsafeRow =>{
>  +if(setters == null) {
>  +  setters = Array.ofDim[UnsafeSetter](exprArray.length)
>  +  for(i <- 0 until exprArray.length) {
>  +setters(i) = exprArray(i).dataType match {
>  +  case IntegerType => (target: UnsafeRow,  value: Any ) =>
>  +target.setInt(i,value.asInstanceOf[Int])
>  +  case LongType => (target: UnsafeRow,  value: Any ) =>
>  +target.setLong(i,value.asInstanceOf[Long])
>  +  case DoubleType => (target: UnsafeRow,  value: Any ) =>
>  +target.setDouble(i,value.asInstanceOf[Double])
>  +  case FloatType => (target: UnsafeRow, value: Any ) =>
>  +target.setFloat(i,value.asInstanceOf[Float])
>  +
>  +  case NullType => (target: UnsafeRow,  value: Any ) =>
>  +target.setNullAt(i)
>  +
>  +  case BooleanType => (target: UnsafeRow,  value: Any ) =>
>  +target.setBoolean(i,value.asInstanceOf[Boolean])
>  +
>  +  case ByteType => (target: UnsafeRow,  value: Any ) =>
>  +target.setByte(i,value.asInstanceOf[Byte])
>  +  case ShortType => (target: UnsafeRow, value: Any ) =>
>  +target.setShort(i,value.asInstanceOf[Short])
>  +
>  +}
>  +  }
>  +}
>  +true
>  +  }
>  +  case _ => false
>  +}
>  +
>   this
> }
>   
> override def apply(input: InternalRow): InternalRow = {
>   var i = 0
>   while (i < exprArray.length) {
>  -  mutableRow(i) = exprArray(i).eval(input)
>  +  if(targetUnsafe) {
>  +setters(i)(mutableRow.asInstanceOf[UnsafeRow], 
> exprArray(i).eval(input))
>  +  }else {
>  +mutableRow(i) = exprArray(i).eval(input)
>  +  }
> i += 1
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13131) Use median time in benchmark

2016-02-03 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131205#comment-15131205
 ] 

Davies Liu commented on SPARK-13131:


[~piccolbo] Thanks for you comments, we also have lots of offline discussion 
and debates here. I'd like to provide more information here.

The benchmark, is a micro benchmark we used internal to measure some 
performance improvements on CPU bound workload. Each benchmark will have 
constant input (not randomly), ran after JVM is warmed up (the first run will 
be dropped).

Because this micro benchmark is used as a tool in daily development, it should 
not take too long to get a rough number, it also should not have too much noise 
without many retires (or the daily development will be slow down much). 

After some testing, We saw that the best time is much stable than average time, 
also reasonable. So we (Reynold, Nong and me) decided to show both best time 
and average time as result of benchmark.

> Use median time in benchmark
> 
>
> Key: SPARK-13131
> URL: https://issues.apache.org/jira/browse/SPARK-13131
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Median time should be more stable than average time in benchmark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13157) ADD JAR command cannot handle path with @ character

2016-02-03 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13157.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11052
[https://github.com/apache/spark/pull/11052]

> ADD JAR command cannot handle path with @ character
> ---
>
> Key: SPARK-13157
> URL: https://issues.apache.org/jira/browse/SPARK-13157
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Priority: Blocker
> Fix For: 2.0.0
>
>
> To reproduce this issue locally, copy {{TestUDTF.jar}} under 
> {{$SPARK_HOME/sql/hive/src/test/resources/TestUDTF.jar}} to 
> {{/tmp/a@b/TestUDTF.jar}}. Then start the Thrift server and run the following 
> commands using Beeline:
> {noformat}
> > add jar file:///tmp/a@b/TestUDTF.jar;
> ...
> > CREATE TEMPORARY FUNCTION udtf_count2 AS 
> > 'org.apache.spark.sql.hive.execution.GenericUDTFCount2';
> Error: org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
> Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.FunctionTask (state=,code=0)
> {noformat}
> Please refer to [this PR comment 
> thread|https://github.com/apache/spark/pull/11040] for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13174) Add API and options for csv data sources

2016-02-03 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13174:
--

 Summary: Add API and options for csv data sources
 Key: SPARK-13174
 URL: https://issues.apache.org/jira/browse/SPARK-13174
 Project: Spark
  Issue Type: New Feature
Reporter: Davies Liu


We should have a API to load csv data source (with some options as arguments), 
similar to json() and jdbc()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13173) Fail to load CSV file with NPE

2016-02-03 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13173:
--

 Summary: Fail to load CSV file with NPE
 Key: SPARK-13173
 URL: https://issues.apache.org/jira/browse/SPARK-13173
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu


{code}
id|end_date|start_date|location
1|2015-10-14 00:00:00|2015-09-14 00:00:00|CA-SF
2|2015-10-15 01:00:20|2015-08-14 00:00:00|CA-SD
3|2015-10-16 02:30:00|2015-01-14 00:00:00|NY-NY
4|2015-10-17 03:00:20|2015-02-14 00:00:00|NY-NY
5|2015-10-18 04:30:00|2014-04-14 00:00:00|CA-SD
{code}

{code}
adult_df = sqlContext.read.\
format("org.apache.spark.sql.execution.datasources.csv").\
option("header", "false").option("delimiter", "|").\
option("inferSchema", "true").load("/tmp/dataframe_sample.csv")
{code}

{code}
Py4JJavaError: An error occurred while calling o239.load.
: java.lang.NullPointerException
at 
scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)
at 
scala.collection.IndexedSeqOptimized$class.zipWithIndex(IndexedSeqOptimized.scala:93)
at 
scala.collection.mutable.ArrayOps$ofRef.zipWithIndex(ArrayOps.scala:108)
at 
org.apache.spark.sql.execution.datasources.csv.CSVRelation.inferSchema(CSVRelation.scala:137)
at 
org.apache.spark.sql.execution.datasources.csv.CSVRelation.dataSchema$lzycompute(CSVRelation.scala:50)
at 
org.apache.spark.sql.execution.datasources.csv.CSVRelation.dataSchema(CSVRelation.scala:48)
at 
org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:666)
at 
org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:665)
at 
org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:39)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:115)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:136)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:290)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12992) Vectorize parquet decoding using ColumnarBatch

2016-02-03 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131336#comment-15131336
 ] 

Davies Liu commented on SPARK-12992:


[~nongli] We usually have one PR for one JIRA, the JIRA will be closed as 
resolved by the merge tools automatically.

> Vectorize parquet decoding using ColumnarBatch
> --
>
> Key: SPARK-12992
> URL: https://issues.apache.org/jira/browse/SPARK-12992
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nong Li
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>
> Parquet files benefit from vectorized decoding. ColumnarBatches have been 
> designed to support this. This means that a single encoded parquet column is 
> decoded to a single ColumnVector. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13131) Use best time and average time in micro benchmark

2016-02-03 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131413#comment-15131413
 ] 

Davies Liu commented on SPARK-13131:


[~srowen] Fully agreed with you, that's my first thought. Some people have 
concern that the best time will lost some information that some algorithm or 
implementation will be larger variance that others, for example, it generate 
lots of garbage, but GC it's not trigger. So we also include the mean time, for 
reference.

> Use best  time and average time in micro benchmark
> --
>
> Key: SPARK-13131
> URL: https://issues.apache.org/jira/browse/SPARK-13131
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Best time should be more stable than average time in benchmark, together with 
> average time, they could show more information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    3   4   5   6   7   8   9   10   11   12   >