date:20160822

[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2016-08-22 Thread DjvuLee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430217#comment-15430217
 ] 

DjvuLee commented on SPARK-3630:


How much data do you test?  we encounter this error in our production. Our data 
is about several TB. The Spark version is 1.6.1, and the snappy version is 
1.1.2.4。

> Identify cause of Kryo+Snappy PARSING_ERROR
> ---
>
> Key: SPARK-3630
> URL: https://issues.apache.org/jira/browse/SPARK-3630
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>
> A recent GraphX commit caused non-deterministic exceptions in unit tests so 
> it was reverted (see SPARK-3400).
> Separately, [~aash] observed the same exception stacktrace in an 
> application-specific Kryo registrator:
> {noformat}
> com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
> uncompress the chunk: PARSING_ERROR(2)
> com.esotericsoftware.kryo.io.Input.fill(Input.java:142) 
> com.esotericsoftware.kryo.io.Input.require(Input.java:169) 
> com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) 
> com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127)
>  
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
>  
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> ...
> {noformat}
> This ticket is to identify the cause of the exception in the GraphX commit so 
> the faulty commit can be fixed and merged back into master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2016-08-22 Thread DjvuLee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430215#comment-15430215
 ] 

DjvuLee commented on SPARK-3630:


How much data do you test?  we encounter this error in our production. Our data 
is about several TB. The Spark version is 1.6.1, and the snappy version is 
1.1.2.4。

> Identify cause of Kryo+Snappy PARSING_ERROR
> ---
>
> Key: SPARK-3630
> URL: https://issues.apache.org/jira/browse/SPARK-3630
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>
> A recent GraphX commit caused non-deterministic exceptions in unit tests so 
> it was reverted (see SPARK-3400).
> Separately, [~aash] observed the same exception stacktrace in an 
> application-specific Kryo registrator:
> {noformat}
> com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
> uncompress the chunk: PARSING_ERROR(2)
> com.esotericsoftware.kryo.io.Input.fill(Input.java:142) 
> com.esotericsoftware.kryo.io.Input.require(Input.java:169) 
> com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) 
> com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127)
>  
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
>  
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> ...
> {noformat}
> This ticket is to identify the cause of the exception in the GraphX commit so 
> the faulty commit can be fixed and merged back into master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2016-08-22 Thread DjvuLee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430215#comment-15430215
 ] 

DjvuLee edited comment on SPARK-3630 at 8/22/16 7:10 AM:
-

Can I know how much data do you test?  We encounter this error in our 
production, our data is about several TB. The Spark version is 1.6.1, and the 
snappy version is 1.1.2.4。When the data is small, we never encounter this error.


was (Author: djvulee):
How much data do you test?  we encounter this error in our production. Our data 
is about several TB. The Spark version is 1.6.1, and the snappy version is 
1.1.2.4。

> Identify cause of Kryo+Snappy PARSING_ERROR
> ---
>
> Key: SPARK-3630
> URL: https://issues.apache.org/jira/browse/SPARK-3630
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>
> A recent GraphX commit caused non-deterministic exceptions in unit tests so 
> it was reverted (see SPARK-3400).
> Separately, [~aash] observed the same exception stacktrace in an 
> application-specific Kryo registrator:
> {noformat}
> com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
> uncompress the chunk: PARSING_ERROR(2)
> com.esotericsoftware.kryo.io.Input.fill(Input.java:142) 
> com.esotericsoftware.kryo.io.Input.require(Input.java:169) 
> com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) 
> com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127)
>  
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
>  
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> ...
> {noformat}
> This ticket is to identify the cause of the exception in the GraphX commit so 
> the faulty commit can be fixed and merged back into master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2016-08-22 Thread DjvuLee (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DjvuLee updated SPARK-3630:
---
Comment: was deleted

(was: How much data do you test?  we encounter this error in our production. 
Our data is about several TB. The Spark version is 1.6.1, and the snappy 
version is 1.1.2.4。)

> Identify cause of Kryo+Snappy PARSING_ERROR
> ---
>
> Key: SPARK-3630
> URL: https://issues.apache.org/jira/browse/SPARK-3630
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>
> A recent GraphX commit caused non-deterministic exceptions in unit tests so 
> it was reverted (see SPARK-3400).
> Separately, [~aash] observed the same exception stacktrace in an 
> application-specific Kryo registrator:
> {noformat}
> com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
> uncompress the chunk: PARSING_ERROR(2)
> com.esotericsoftware.kryo.io.Input.fill(Input.java:142) 
> com.esotericsoftware.kryo.io.Input.require(Input.java:169) 
> com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) 
> com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127)
>  
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
>  
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> ...
> {noformat}
> This ticket is to identify the cause of the exception in the GraphX commit so 
> the faulty commit can be fixed and merged back into master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5770) Use addJar() to upload a new jar file to executor, it can't be added to classloader

2016-08-22 Thread marymwu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430243#comment-15430243
 ] 

marymwu commented on SPARK-5770:


Hey, we have ran into the same issue too. We try to fix this but failed. 
Anybody can help on this issue, thank so much!

> Use addJar() to upload a new jar file to executor, it can't be added to 
> classloader
> ---
>
> Key: SPARK-5770
> URL: https://issues.apache.org/jira/browse/SPARK-5770
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: meiyoula
>Priority: Minor
>
> First use addJar() to upload a jar to the executor, then change the jar 
> content and upload it again. We can see the jar file in the local has be 
> updated, but the classloader still load the old one. The executor log has no 
> error or exception to point it.
> I use spark-shell to test it. And set "spark.files.overwrite" is true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15285) Generated SpecificSafeProjection.apply method grows beyond 64 KB

2016-08-22 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-15285.
-
   Resolution: Fixed
Fix Version/s: (was: 2.0.0)
   2.1.0
   2.0.1

> Generated SpecificSafeProjection.apply method grows beyond 64 KB
> 
>
> Key: SPARK-15285
> URL: https://issues.apache.org/jira/browse/SPARK-15285
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Konstantin Shaposhnikov
>Assignee: Kazuaki Ishizaki
> Fix For: 2.0.1, 2.1.0
>
>
> The following code snippet results in 
> {noformat}
>  org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Ljava/lang/Object;)Ljava/lang/Object;" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
> {noformat}
> {code}
> case class S100(s1:String="1", s2:String="2", s3:String="3", s4:String="4", 
> s5:String="5", s6:String="6", s7:String="7", s8:String="8", s9:String="9", 
> s10:String="10", s11:String="11", s12:String="12", s13:String="13", 
> s14:String="14", s15:String="15", s16:String="16", s17:String="17", 
> s18:String="18", s19:String="19", s20:String="20", s21:String="21", 
> s22:String="22", s23:String="23", s24:String="24", s25:String="25", 
> s26:String="26", s27:String="27", s28:String="28", s29:String="29", 
> s30:String="30", s31:String="31", s32:String="32", s33:String="33", 
> s34:String="34", s35:String="35", s36:String="36", s37:String="37", 
> s38:String="38", s39:String="39", s40:String="40", s41:String="41", 
> s42:String="42", s43:String="43", s44:String="44", s45:String="45", 
> s46:String="46", s47:String="47", s48:String="48", s49:String="49", 
> s50:String="50", s51:String="51", s52:String="52", s53:String="53", 
> s54:String="54", s55:String="55", s56:String="56", s57:String="57", 
> s58:String="58", s59:String="59", s60:String="60", s61:String="61", 
> s62:String="62", s63:String="63", s64:String="64", s65:String="65", 
> s66:String="66", s67:String="67", s68:String="68", s69:String="69", 
> s70:String="70", s71:String="71", s72:String="72", s73:String="73", 
> s74:String="74", s75:String="75", s76:String="76", s77:String="77", 
> s78:String="78", s79:String="79", s80:String="80", s81:String="81", 
> s82:String="82", s83:String="83", s84:String="84", s85:String="85", 
> s86:String="86", s87:String="87", s88:String="88", s89:String="89", 
> s90:String="90", s91:String="91", s92:String="92", s93:String="93", 
> s94:String="94", s95:String="95", s96:String="96", s97:String="97", 
> s98:String="98", s99:String="99", s100:String="100")
> case class S(s1: S100=S100(), s2: S100=S100(), s3: S100=S100(), s4: 
> S100=S100(), s5: S100=S100(), s6: S100=S100(), s7: S100=S100(), s8: 
> S100=S100(), s9: S100=S100(), s10: S100=S100())
> val ds = Seq(S(),S(),S()).toDS
> ds.show()
> {code}
> I could reproduce this with Spark built from 1.6 branch and with 
> https://home.apache.org/~pwendell/spark-nightly/spark-master-bin/spark-2.0.0-SNAPSHOT-2016_05_11_01_03-8beae59-bin/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17168) CSV with header is incorrectly read if file is partitioned

2016-08-22 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430258#comment-15430258
 ] 

Sean Owen commented on SPARK-17168:
---

It's a tough call. I can imagine for example a process ingesting lines of a 
huge CSV file and outputting them after some generic transformation. One file, 
with one header, may become many files ... of which only the first has a 
header. It's unclear whether that or having headers in every file is 'normal'.

I'm not sure it's easy to implement, but I could imagine skipping the first 
line of any file that matches the first line of the first file. 

> CSV with header is incorrectly read if file is partitioned
> --
>
> Key: SPARK-17168
> URL: https://issues.apache.org/jira/browse/SPARK-17168
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Mathieu D
>Priority: Minor
>
> If a CSV file is stored in a partitioned fashion, the DataframeReader.csv 
> with option header set to true skips the first line of *each partition* 
> instead of skipping only the first one.
> ex:
> {code}
> // create a partitioned CSV file with header : 
> val rdd=sc.parallelize(Seq("hdr","1","2","3","4","5","6"), numSlices=2)
> rdd.saveAsTextFile("foo")
> {code}
> Now, if we try to read it with DataframeReader, the first row of the 2nd 
> partition is skipped.
> {code}
> val df = spark.read.option("header","true").csv("foo")
> df.show
> +---+
> |hdr|
> +---+
> |  1|
> |  2|
> |  4|
> |  5|
> |  6|
> +---+
> // one row is missing
> {code}
> I more or less understand that this is to be consistent with the save 
> operation of dataframewriter which saves header on each individual partition.
> But this is very error-prone. In our case, we have large CSV files with 
> headers already stored in a partitioned way, so we will lose rows if we read 
> with header set to true. So we have to manually handle the headers.
> I suggest a tri-valued option for header, with something like 
> "skipOnFirstPartition"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17090) Make tree aggregation level in linear/logistic regression configurable

2016-08-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430266#comment-15430266
 ] 

Apache Spark commented on SPARK-17090:
--

User 'hqzizania' has created a pull request for this issue:
https://github.com/apache/spark/pull/14738

> Make tree aggregation level in linear/logistic regression configurable
> --
>
> Key: SPARK-17090
> URL: https://issues.apache.org/jira/browse/SPARK-17090
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
> Fix For: 2.1.0
>
>
> Linear/logistic regression use treeAggregate with default aggregation depth 
> for collecting coefficient gradient updates to the driver. For high 
> dimensional problems, this can case OOM error on the driver. We should make 
> it configurable, perhaps via an expert param, so that users can avoid this 
> problem if their data has many features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17127) Include AArch64 in the check of cached unaligned-access capability

2016-08-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17127.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14700
[https://github.com/apache/spark/pull/14700]

> Include AArch64 in the check of cached unaligned-access capability
> --
>
> Key: SPARK-17127
> URL: https://issues.apache.org/jira/browse/SPARK-17127
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: AArch64
>Reporter: Richael Zhuang
>Priority: Minor
> Fix For: 2.1.0
>
>
> From the spark of version 2.0.0 , when MemoryMode.OFF_HEAP is set , whether 
> the architecture supports unaligned access or not is checked. If the check 
> doesn't pass, exception is raised. 
> We know that AArch64 also supports unaligned access , but now only i386, x86, 
> amd64, and X86_64 are included. 
> I think we should include aarch64 when performing the check.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17127) Include AArch64 in the check of cached unaligned-access capability

2016-08-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17127:
--
Assignee: Richael Zhuang

> Include AArch64 in the check of cached unaligned-access capability
> --
>
> Key: SPARK-17127
> URL: https://issues.apache.org/jira/browse/SPARK-17127
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: AArch64
>Reporter: Richael Zhuang
>Assignee: Richael Zhuang
>Priority: Minor
> Fix For: 2.1.0
>
>
> From the spark of version 2.0.0 , when MemoryMode.OFF_HEAP is set , whether 
> the architecture supports unaligned access or not is checked. If the check 
> doesn't pass, exception is raised. 
> We know that AArch64 also supports unaligned access , but now only i386, x86, 
> amd64, and X86_64 are included. 
> I think we should include aarch64 when performing the check.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data

2016-08-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430270#comment-15430270
 ] 

Apache Spark commented on SPARK-17086:
--

User 'VinceShieh' has created a pull request for this issue:
https://github.com/apache/spark/pull/14747

> QuantileDiscretizer throws InvalidArgumentException (parameter splits given 
> invalid value) on valid data
> 
>
> Key: SPARK-17086
> URL: https://issues.apache.org/jira/browse/SPARK-17086
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
>
> I discovered this bug when working with a build from the master branch (which 
> I believe is 2.1.0). This used to work fine when running spark 1.6.2.
> I have a dataframe with an "intData" column that has values like 
> {code}
> 1 3 2 1 1 2 3 2 2 2 1 3
> {code}
> I have a stage in my pipeline that uses the QuantileDiscretizer to produce 
> equal weight splits like this
> {code}
> new QuantileDiscretizer()
> .setInputCol("intData")
> .setOutputCol("intData_bin")
> .setNumBuckets(10)
> .fit(df)
> {code}
> But when that gets run it (incorrectly) throws this error:
> {code}
> parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, 
> 3.0, Infinity]
> {code}
> I don't think that there should be duplicate splits generated should there be?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data

2016-08-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17086:


Assignee: (was: Apache Spark)

> QuantileDiscretizer throws InvalidArgumentException (parameter splits given 
> invalid value) on valid data
> 
>
> Key: SPARK-17086
> URL: https://issues.apache.org/jira/browse/SPARK-17086
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
>
> I discovered this bug when working with a build from the master branch (which 
> I believe is 2.1.0). This used to work fine when running spark 1.6.2.
> I have a dataframe with an "intData" column that has values like 
> {code}
> 1 3 2 1 1 2 3 2 2 2 1 3
> {code}
> I have a stage in my pipeline that uses the QuantileDiscretizer to produce 
> equal weight splits like this
> {code}
> new QuantileDiscretizer()
> .setInputCol("intData")
> .setOutputCol("intData_bin")
> .setNumBuckets(10)
> .fit(df)
> {code}
> But when that gets run it (incorrectly) throws this error:
> {code}
> parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, 
> 3.0, Infinity]
> {code}
> I don't think that there should be duplicate splits generated should there be?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data

2016-08-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17086:


Assignee: Apache Spark

> QuantileDiscretizer throws InvalidArgumentException (parameter splits given 
> invalid value) on valid data
> 
>
> Key: SPARK-17086
> URL: https://issues.apache.org/jira/browse/SPARK-17086
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
>Assignee: Apache Spark
>
> I discovered this bug when working with a build from the master branch (which 
> I believe is 2.1.0). This used to work fine when running spark 1.6.2.
> I have a dataframe with an "intData" column that has values like 
> {code}
> 1 3 2 1 1 2 3 2 2 2 1 3
> {code}
> I have a stage in my pipeline that uses the QuantileDiscretizer to produce 
> equal weight splits like this
> {code}
> new QuantileDiscretizer()
> .setInputCol("intData")
> .setOutputCol("intData_bin")
> .setNumBuckets(10)
> .fit(df)
> {code}
> But when that gets run it (incorrectly) throws this error:
> {code}
> parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, 
> 3.0, Infinity]
> {code}
> I don't think that there should be duplicate splits generated should there be?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15285) Generated SpecificSafeProjection.apply method grows beyond 64 KB

2016-08-22 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430273#comment-15430273
 ] 

Dongjoon Hyun commented on SPARK-15285:
---

Thank you!

> Generated SpecificSafeProjection.apply method grows beyond 64 KB
> 
>
> Key: SPARK-15285
> URL: https://issues.apache.org/jira/browse/SPARK-15285
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Konstantin Shaposhnikov
>Assignee: Kazuaki Ishizaki
> Fix For: 2.0.1, 2.1.0
>
>
> The following code snippet results in 
> {noformat}
>  org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Ljava/lang/Object;)Ljava/lang/Object;" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
> {noformat}
> {code}
> case class S100(s1:String="1", s2:String="2", s3:String="3", s4:String="4", 
> s5:String="5", s6:String="6", s7:String="7", s8:String="8", s9:String="9", 
> s10:String="10", s11:String="11", s12:String="12", s13:String="13", 
> s14:String="14", s15:String="15", s16:String="16", s17:String="17", 
> s18:String="18", s19:String="19", s20:String="20", s21:String="21", 
> s22:String="22", s23:String="23", s24:String="24", s25:String="25", 
> s26:String="26", s27:String="27", s28:String="28", s29:String="29", 
> s30:String="30", s31:String="31", s32:String="32", s33:String="33", 
> s34:String="34", s35:String="35", s36:String="36", s37:String="37", 
> s38:String="38", s39:String="39", s40:String="40", s41:String="41", 
> s42:String="42", s43:String="43", s44:String="44", s45:String="45", 
> s46:String="46", s47:String="47", s48:String="48", s49:String="49", 
> s50:String="50", s51:String="51", s52:String="52", s53:String="53", 
> s54:String="54", s55:String="55", s56:String="56", s57:String="57", 
> s58:String="58", s59:String="59", s60:String="60", s61:String="61", 
> s62:String="62", s63:String="63", s64:String="64", s65:String="65", 
> s66:String="66", s67:String="67", s68:String="68", s69:String="69", 
> s70:String="70", s71:String="71", s72:String="72", s73:String="73", 
> s74:String="74", s75:String="75", s76:String="76", s77:String="77", 
> s78:String="78", s79:String="79", s80:String="80", s81:String="81", 
> s82:String="82", s83:String="83", s84:String="84", s85:String="85", 
> s86:String="86", s87:String="87", s88:String="88", s89:String="89", 
> s90:String="90", s91:String="91", s92:String="92", s93:String="93", 
> s94:String="94", s95:String="95", s96:String="96", s97:String="97", 
> s98:String="98", s99:String="99", s100:String="100")
> case class S(s1: S100=S100(), s2: S100=S100(), s3: S100=S100(), s4: 
> S100=S100(), s5: S100=S100(), s6: S100=S100(), s7: S100=S100(), s8: 
> S100=S100(), s9: S100=S100(), s10: S100=S100())
> val ds = Seq(S(),S(),S()).toDS
> ds.show()
> {code}
> I could reproduce this with Spark built from 1.6 branch and with 
> https://home.apache.org/~pwendell/spark-nightly/spark-master-bin/spark-2.0.0-SNAPSHOT-2016_05_11_01_03-8beae59-bin/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.

2016-08-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17172.
---
Resolution: Duplicate

That sounds like exactly the same issue. You're missing /tmp or can't see it 
because of permissions issue. Both sound like an environment problem, because 
this dir exists and is world-writable after setting up any HDFS namenode. At 
the least, let's merge into your other issue.

> pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while 
> calling None.org.apache.spark.sql.hive.HiveContext. 
> --
>
> Key: SPARK-17172
> URL: https://issues.apache.org/jira/browse/SPARK-17172
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2
> Environment: spark version: 1.6.2
> python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
>Reporter: Andrew Davidson
> Attachments: hiveUDFBug.html, hiveUDFBug.ipynb
>
>
> from pyspark.sql import HiveContext
> sqlContext = HiveContext(sc)
> # Define udf
> from pyspark.sql.functions import udf
> def scoreToCategory(score):
> if score >= 80: return 'A'
> elif score >= 60: return 'B'
> elif score >= 35: return 'C'
> else: return 'D'
>  
> udfScoreToCategory=udf(scoreToCategory, StringType())
> throws exception
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.lang.RuntimeException: Unable to 
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-22 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430278#comment-15430278
 ] 

Sean Owen commented on SPARK-17143:
---

This sounds like an HDFS environment problem then. This dir would exist and be 
writable to all users when HDFS's file system is created.

> pyspark unable to create UDF: java.lang.RuntimeException: 
> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
> directory: /tmp tmp
> ---
>
> Key: SPARK-17143
> URL: https://issues.apache.org/jira/browse/SPARK-17143
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: spark version: 1.6.1
> python version: 3.4.3 (default, Apr  1 2015, 18:10:40) 
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
>Reporter: Andrew Davidson
> Attachments: udfBug.html, udfBug.ipynb
>
>
> For unknown reason I can not create UDF when I run the attached notebook on 
> my cluster. I get the following error
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: 
> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
> directory: /tmp tmp
> The notebook runs fine on my Mac
> In general I am able to run non UDF spark code with out any trouble
> I start the notebook server as the user “ec2-user" and uses master URL 
>   spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066
> I found the following message in the notebook server log file. I have log 
> level set to warn
> 16/08/18 21:38:45 WARN ObjectStore: Version information not found in 
> metastore. hive.metastore.schema.verification is not enabled so recording the 
> schema version 1.2.0
> 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning 
> NoSuchObjectException
> The cluster was originally created using 
> spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2
> #from pyspark.sql import SQLContext, HiveContext
> #sqlContext = SQLContext(sc)
> 
> #from pyspark.sql import DataFrame
> #from pyspark.sql import functions
> 
> from pyspark.sql.types import StringType
> from pyspark.sql.functions import udf
> 
> print("spark version: {}".format(sc.version))
> 
> import sys
> print("python version: {}".format(sys.version))
> spark version: 1.6.1
> python version: 3.4.3 (default, Apr  1 2015, 18:10:40)
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
> # functions.lower() raises 
> # py4j.Py4JException: Method lower([class java.lang.String]) does not exist
> # work around define a UDF
> toLowerUDFRetType = StringType()
> #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> toLowerUDF = udf(lambda s : s.lower(), StringType())
> You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt 
> assembly
> Py4JJavaErrorTraceback (most recent call last)
>  in ()
>   4 toLowerUDFRetType = StringType()
>   5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> > 6 toLowerUDF = udf(lambda s : s.lower(), StringType())
> /root/spark/python/pyspark/sql/functions.py in udf(f, returnType)
>1595 [Row(slen=5), Row(slen=3)]
>1596 """
> -> 1597 return UserDefinedFunction(f, returnType)
>1598
>1599 blacklist = ['map', 'since', 'ignore_unicode_prefix']
> /root/spark/python/pyspark/sql/functions.py in __init__(self, func, 
> returnType, name)
>1556 self.returnType = returnType
>1557 self._broadcast = None
> -> 1558 self._judf = self._create_judf(name)
>1559
>1560 def _create_judf(self, name):
> /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name)
>1567 pickled_command, broadcast_vars, env, includes = 
> _prepare_for_python_RDD(sc, command, self)
>1568 ctx = SQLContext.getOrCreate(sc)
> -> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json())
>1570 if name is None:
>1571 name = f.__name__ if hasattr(f, '__name__') else 
> f.__class__.__name__
> /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
> 681 try:
> 682 if not hasattr(self, '_scala_HiveContext'):
> --> 683 self._scala_HiveContext = self._get_hive_ctx()
> 684 return self._scala_HiveContext
> 685 except Py4JError as e:
> /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self)
> 690
> 691 def _get_hive_ctx(self):
> --> 692 return self._jvm.HiveContext(self._jsc.sc())
> 693
> 694 def refreshTable(self, tableName):
> /root/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
>1062 answer = self._gateway_client.sen

[jira] [Resolved] (SPARK-17115) Improve the performance of UnsafeProjection for wide table

2016-08-22 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17115.
-
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14692
[https://github.com/apache/spark/pull/14692]

> Improve the performance of UnsafeProjection for wide table
> --
>
> Key: SPARK-17115
> URL: https://issues.apache.org/jira/browse/SPARK-17115
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.1, 2.1.0
>
>
> We increase the threshold for splitting the generate code for expressions to 
> 64K in 2.0 by accident (too optimistic), which could cause bad performance 
> (the huge method may not be JITed), we should decrease that to 16K.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17169) To use scala macros to update code when SharedParamsCodeGen.scala changed

2016-08-22 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430282#comment-15430282
 ] 

Yanbo Liang commented on SPARK-17169:
-

Meanwhile, it's better we can do compile time code-gen for python params as 
well, that is to say run {{python _shared_params_code_gen.py > shared.py}} 
automatically.

> To use scala macros to update code when SharedParamsCodeGen.scala changed
> -
>
> Key: SPARK-17169
> URL: https://issues.apache.org/jira/browse/SPARK-17169
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Qian Huang
>Priority: Minor
>
> As commented in the file SharedParamsCodeGen.scala, we have to manually run
> build/sbt "mllib/runMain org.apache.spark.ml.param.shared.SharedParamsCodeGen"
> to generate and update it.
> It could be better to do compile time code-gen for this using scala macros 
> rather than running the script as described above. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10925) Exception when joining DataFrames

2016-08-22 Thread Alexander Bij (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430290#comment-15430290
 ] 

Alexander Bij commented on SPARK-10925:
---

We also encountered this issue using with (HDP 2.4.2.0) *Spark 1.6.1*.

Our example:
{code}
// id, datum are String columns.
val mt = hiveContext.sql("SELECT id, datum FROM my_test")

// create aggregated dataframe: 
val my_max = mt.groupBy("id").agg(max("datum")).withColumnRenamed("max(datum)", 
"datum")

// Fails (start with Aggregation-DataFrame)
val my_max_mt = my_max.join(mt, my_max("datum") === mt("datum"), "inner")
 
// Works (start with normal-DataFrame)
val my_max_mt = mt.join(my_max, my_max("datum") === mt("datum"), "inner")

// running these queries as SQL works both ways.
{code}

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark

[jira] [Comment Edited] (SPARK-10925) Exception when joining DataFrames

2016-08-22 Thread Alexander Bij (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430290#comment-15430290
 ] 

Alexander Bij edited comment on SPARK-10925 at 8/22/16 8:29 AM:


We also encountered this issue using with (HDP 2.4.2.0) *Spark 1.6.1*.

Our example:
{code}
// id, datum are String columns.
val mt = hiveContext.sql("SELECT id, datum FROM my_test")

// create aggregated dataframe: 
val my_max = mt.groupBy("id").agg(max("datum")).withColumnRenamed("max(datum)", 
"datum")

// Fails (start with Aggregation-DataFrame)
val my_max_mt = my_max.join(mt, my_max("datum") === mt("datum"), "inner")
 
// Works (start with normal-DataFrame)
val my_max_mt = mt.join(my_max, my_max("datum") === mt("datum"), "inner")

// running these queries as SQL works both ways.
{code}
{code}
// Complaining about the datum#526 which is the 'mt("datum")' in the query:

org.apache.spark.sql.AnalysisException: resolved attribute(s) datum#526 missing 
from id#525,datum#528,id#532,datum#533 in operator !Join Inner, Some((datum#528 
= datum#526));
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:183)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:121)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
{code}


was (Author: abij):
We also encountered this issue using with (HDP 2.4.2.0) *Spark 1.6.1*.

Our example:
{code}
// id, datum are String columns.
val mt = hiveContext.sql("SELECT id, datum FROM my_test")

// create aggregated dataframe: 
val my_max = mt.groupBy("id").agg(max("datum")).withColumnRenamed("max(datum)", 
"datum")

// Fails (start with Aggregation-DataFrame)
val my_max_mt = my_max.join(mt, my_max("datum") === mt("datum"), "inner")
 
// Works (start with normal-DataFrame)
val my_max_mt = mt.join(my_max, my_max("datum") === mt("datum"), "inner")

// running these queries as SQL works both ways.
{code}

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apac

[jira] [Updated] (SPARK-17085) Documentation and actual code differs - Unsupported Operations

2016-08-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17085:
--
Assignee: Jagadeesan A S

> Documentation and actual code differs - Unsupported Operations
> --
>
> Key: SPARK-17085
> URL: https://issues.apache.org/jira/browse/SPARK-17085
> Project: Spark
>  Issue Type: Documentation
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Samritti
>Assignee: Jagadeesan A S
>Priority: Minor
>
> Spark Stuctured Streaming doc in this link
> https://spark.apache.org/docs/2.0.0/structured-streaming-programming-guide.html#unsupported-operations
> mentions 
> >>>"Right outer join with a streaming Dataset on the right is not supported"
>  but the code here conveys a different/opposite error
> https://github.com/apache/spark/blob/5545b791096756b07b3207fb3de13b68b9a37b00/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala#L114
> >>>"Right outer join with a streaming DataFrame/Dataset on the left is " +
> "not supported"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17085) Documentation and actual code differs - Unsupported Operations

2016-08-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17085.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Resolved by https://github.com/apache/spark/pull/14715

> Documentation and actual code differs - Unsupported Operations
> --
>
> Key: SPARK-17085
> URL: https://issues.apache.org/jira/browse/SPARK-17085
> Project: Spark
>  Issue Type: Documentation
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Samritti
>Assignee: Jagadeesan A S
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> Spark Stuctured Streaming doc in this link
> https://spark.apache.org/docs/2.0.0/structured-streaming-programming-guide.html#unsupported-operations
> mentions 
> >>>"Right outer join with a streaming Dataset on the right is not supported"
>  but the code here conveys a different/opposite error
> https://github.com/apache/spark/blob/5545b791096756b07b3207fb3de13b68b9a37b00/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala#L114
> >>>"Right outer join with a streaming DataFrame/Dataset on the left is " +
> "not supported"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10925) Exception when joining DataFrames

2016-08-22 Thread Alexander Bij (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430302#comment-15430302
 ] 

Alexander Bij commented on SPARK-10925:
---

Relates to issue SPARK-14948 (Exception joining same DF)

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520)
>   at TestCase2$.main(TestCase2.scala:51)
>   at TestCase2.main(TestCase2.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at com.inte

[jira] [Updated] (SPARK-17179) Consider improving partition pruning in HiveMetastoreCatalog

2016-08-22 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-17179:
---
Affects Version/s: 2.0.0
 Priority: Major  (was: Critical)
  Description: 
Issue:
- Create an external table with 1000s of partition
- Running simple query with partition details ends up listing all files for 
caching in ListingFileCatalog.  This would turn out to be very slow in cloud 
based FS access (e.g S3). Even though, ListingFileCatalog supports
multi-threading, it would end up unncessarily listing 1000+ files when user is 
just interested in 1 partition.
- This adds up additional overhead in HiveMetastoreCatalog as it queries all 
partitions in convertToLogicalRelation 
(metastoreRelation.getHiveQlPartitions()).  Partition related details
are not passed in here, so ends up overloading hive metastore.
- Also even if any partition changes, cache would be dirtied and have to be 
re-populated.  It would be nice to prune the partitions in metastore layer 
itself, so that few partitions are looked up via FileSystem and only few items 
are cached.

{noformat}
"CREATE EXTERNAL TABLE `ca_par_ext`(
  `customer_id` bigint,
  `account_id` bigint)
PARTITIONED BY (
  `effective_date` date)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3a://bucket_details/ca_par'"

explain select count(*) from ca_par_ext where effective_date between 
'2015-12-17' and '2015-12-18';

{noformat}

  was:

Issue:
- Create an external table with 1000s of partition
- Running simple query with partition details ends up listing all files for 
caching in ListingFileCatalog.  This would turn out to be very slow in cloud 
based FS access (e.g S3). Even though, ListingFileCatalog supports
multi-threading, it would end up unncessarily listing 1000+ files when user is 
just interested in 1 partition.
- This adds up additional overhead in HiveMetastoreCatalog as it queries all 
partitions in convertToLogicalRelation 
(metastoreRelation.getHiveQlPartitions()).  Partition related details
are not passed in here, so ends up overloading hive metastore.
- Also even if any partition changes, cache would be dirtied and have to be 
re-populated.  It would be nice to prune the partitions in metastore layer 
itself, so that few partitions are looked up via FileSystem and only few items 
are cached.

{noformat}
"CREATE EXTERNAL TABLE `ca_par_ext`(
  `customer_id` bigint,
  `account_id` bigint)
PARTITIONED BY (
  `effective_date` date)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3a://bucket_details/ca_par'"

explain select count(*) from ca_par_ext where effective_date between 
'2015-12-17' and '2015-12-18';

{noformat}


> Consider improving partition pruning in HiveMetastoreCatalog
> 
>
> Key: SPARK-17179
> URL: https://issues.apache.org/jira/browse/SPARK-17179
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Rajesh Balamohan
>
> Issue:
> - Create an external table with 1000s of partition
> - Running simple query with partition details ends up listing all files for 
> caching in ListingFileCatalog.  This would turn out to be very slow in cloud 
> based FS access (e.g S3). Even though, ListingFileCatalog supports
> multi-threading, it would end up unncessarily listing 1000+ files when user 
> is just interested in 1 partition.
> - This adds up additional overhead in HiveMetastoreCatalog as it queries all 
> partitions in convertToLogicalRelation 
> (metastoreRelation.getHiveQlPartitions()).  Partition related details
> are not passed in here, so ends up overloading hive metastore.
> - Also even if any partition changes, cache would be dirtied and have to be 
> re-populated.  It would be nice to prune the partitions in metastore layer 
> itself, so that few partitions are looked up via FileSystem and only few 
> items are cached.
> {noformat}
> "CREATE EXTERNAL TABLE `ca_par_ext`(
>   `customer_id` bigint,
>   `account_id` bigint)
> PARTITIONED BY (
>   `effective_date` date)
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   's3a://bucket_details/ca_par'"
> explain select count(*) from ca_par_ext where effect

[jira] [Commented] (SPARK-14948) Exception when joining DataFrames derived form the same DataFrame

2016-08-22 Thread Alexander Bij (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430312#comment-15430312
 ] 

Alexander Bij commented on SPARK-14948:
---

We encountered the same issue with Spark 1.6.1.
I have posted a simple Scala example in SPARK-10925 (duplication of this 
issue). 


> Exception when joining DataFrames derived form the same DataFrame
> -
>
> Key: SPARK-14948
> URL: https://issues.apache.org/jira/browse/SPARK-14948
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Saurabh Santhosh
>
> h2. Spark Analyser is throwing the following exception in a specific scenario 
> :
> h2. Exception :
> org.apache.spark.sql.AnalysisException: resolved attribute(s) F1#3 missing 
> from asd#5,F2#4,F1#6,F2#7 in operator !Project [asd#5,F1#3];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> h2. Code :
> {code:title=SparkClient.java|borderStyle=solid}
> StructField[] fields = new StructField[2];
> fields[0] = new StructField("F1", DataTypes.StringType, true, 
> Metadata.empty());
> fields[1] = new StructField("F2", DataTypes.StringType, true, 
> Metadata.empty());
> JavaRDD rdd =
> 
> sparkClient.getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("a",
>  "b")));
> DataFrame df = sparkClient.getSparkHiveContext().createDataFrame(rdd, new 
> StructType(fields));
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t1");
> DataFrame aliasedDf = sparkClient.getSparkHiveContext().sql("select F1 as 
> asd, F2 from t1");
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(aliasedDf, 
> "t2");
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t3");
> 
> DataFrame join = aliasedDf.join(df, 
> aliasedDf.col("F2").equalTo(df.col("F2")), "inner");
> DataFrame select = join.select(aliasedDf.col("asd"), df.col("F1"));
> select.collect();
> {code}
> h2. Observations :
> * This issue is related to the Data Type of Fields of the initial Data 
> Frame.(If the Data Type is not String, it will work.)
> * It works fine if the data frame is registered as a temporary table and an 
> sql (select a.asd,b.F1 from t2 a inner join t3 b on a.F2=b.F2) is written.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-08-22 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430314#comment-15430314
 ] 

Sean Owen commented on SPARK-15044:
---

The use case here is that someone or something deleted some data from an 
immutable data set that shouldn't have been deleted? That doesn't sound like a 
good use case. I understand the argument for trying to proceed to return a 
meaningful result anyway. But I'm saying there isn't a meaningful result here 
because the data set conceptually contains data that can no longer be correctly 
queried. Compare also to the risks of silently (or, with a warning that scrolls 
by) succeeding in returning an incomplete result.

> spark-sql will throw "input path does not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually
> -
>
> Key: SPARK-15044
> URL: https://issues.apache.org/jira/browse/SPARK-15044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: huangyu
>
> spark-sql will throw "input path not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually.The 
> situation is as follows:
> 1) Create a table "test". "create table test (n string) partitioned by (p 
> string)"
> 2) Load some data into partition(p='1')
> 3)Remove the path related to partition(p='1') of table test manually. "hadoop 
> fs -rmr /warehouse//test/p=1"
> 4)Run spark sql, spark-sql -e "select n from test where p='1';"
> Then it throws exception:
> {code}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> ./test/p=1
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
> at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> {code}
> The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK
> I think spark-sql should ignore the path, just like hive or it dose in early 
> versions, rather than throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
F

[jira] [Commented] (SPARK-16781) java launched by PySpark as gateway may not be the same java used in the spark environment

2016-08-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430341#comment-15430341
 ] 

Apache Spark commented on SPARK-16781:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/14748

> java launched by PySpark as gateway may not be the same java used in the 
> spark environment
> --
>
> Key: SPARK-16781
> URL: https://issues.apache.org/jira/browse/SPARK-16781
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2
>Reporter: Michael Berman
>
> When launching spark on a system with multiple javas installed, there are a 
> few options for choosing which JRE to use, setting `JAVA_HOME` being the most 
> straightforward.
> However, when pyspark's internal py4j launches its JavaGateway, it always 
> invokes `java` directly, without qualification. This means you get whatever 
> java's first on your path, which is not necessarily the same one in spark's 
> JAVA_HOME.
> This could be seen as a py4j issue, but from their point of view, the fix is 
> easy: make sure the java you want is first on your path. I can't figure out a 
> way to make that reliably happen through the pyspark executor launch path, 
> and it seems like something that would ideally happen automatically. If I set 
> JAVA_HOME when launching spark, I would expect that to be the only java used 
> throughout the stack.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16781) java launched by PySpark as gateway may not be the same java used in the spark environment

2016-08-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16781:


Assignee: Apache Spark

> java launched by PySpark as gateway may not be the same java used in the 
> spark environment
> --
>
> Key: SPARK-16781
> URL: https://issues.apache.org/jira/browse/SPARK-16781
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2
>Reporter: Michael Berman
>Assignee: Apache Spark
>
> When launching spark on a system with multiple javas installed, there are a 
> few options for choosing which JRE to use, setting `JAVA_HOME` being the most 
> straightforward.
> However, when pyspark's internal py4j launches its JavaGateway, it always 
> invokes `java` directly, without qualification. This means you get whatever 
> java's first on your path, which is not necessarily the same one in spark's 
> JAVA_HOME.
> This could be seen as a py4j issue, but from their point of view, the fix is 
> easy: make sure the java you want is first on your path. I can't figure out a 
> way to make that reliably happen through the pyspark executor launch path, 
> and it seems like something that would ideally happen automatically. If I set 
> JAVA_HOME when launching spark, I would expect that to be the only java used 
> throughout the stack.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16781) java launched by PySpark as gateway may not be the same java used in the spark environment

2016-08-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16781:


Assignee: (was: Apache Spark)

> java launched by PySpark as gateway may not be the same java used in the 
> spark environment
> --
>
> Key: SPARK-16781
> URL: https://issues.apache.org/jira/browse/SPARK-16781
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2
>Reporter: Michael Berman
>
> When launching spark on a system with multiple javas installed, there are a 
> few options for choosing which JRE to use, setting `JAVA_HOME` being the most 
> straightforward.
> However, when pyspark's internal py4j launches its JavaGateway, it always 
> invokes `java` directly, without qualification. This means you get whatever 
> java's first on your path, which is not necessarily the same one in spark's 
> JAVA_HOME.
> This could be seen as a py4j issue, but from their point of view, the fix is 
> easy: make sure the java you want is first on your path. I can't figure out a 
> way to make that reliably happen through the pyspark executor launch path, 
> and it seems like something that would ideally happen automatically. If I set 
> JAVA_HOME when launching spark, I would expect that to be the only java used 
> throughout the stack.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator

2016-08-22 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430359#comment-15430359
 ] 

Sean Owen commented on SPARK-17055:
---

Yes, I understand how model fitting works. If a label is present in test but 
not train data, the model will never predict it. This doesn't require any code 
at all; it's always true. What would you be evaluating by holding out all data 
with a certain label from training?

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator

2016-08-22 Thread Vincent (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430381#comment-15430381
 ] 

Vincent commented on SPARK-17055:
-

well, a better model will have a better cv performance on data with unseen 
labels, so the final selected model will have a relatively better capability on 
predicting samples with unseen categories/labels in real case.

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17055) add labelKFold to CrossValidator

2016-08-22 Thread Vincent (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430381#comment-15430381
 ] 

Vincent edited comment on SPARK-17055 at 8/22/16 9:14 AM:
--

well, a better model will have a better cv performance on validation data with 
unseen labels, so the final selected model will have a relatively better 
capability on predicting samples with unseen categories/labels in real case.


was (Author: vincexie):
well, a better model will have a better cv performance on data with unseen 
labels, so the final selected model will have a relatively better capability on 
predicting samples with unseen categories/labels in real case.

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-10110) StringIndexer lacks of parameter "handleInvalid".

2016-08-22 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath closed SPARK-10110.
--
Resolution: Duplicate

> StringIndexer lacks of parameter "handleInvalid".
> -
>
> Key: SPARK-10110
> URL: https://issues.apache.org/jira/browse/SPARK-10110
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Kai Sasaki
>  Labels: ML
>
> Missing API for pyspark {{StringIndexer.handleInvalid}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator

2016-08-22 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430386#comment-15430386
 ] 

Sean Owen commented on SPARK-17055:
---

The model will always have 0% accuracy on CV / test data whose label was not in 
the training data. Can you give me an example that clarifies what you have in 
mind? I don't think this statement is true.

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17148) NodeManager exit because of exception “Executor is not registered”

2016-08-22 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430402#comment-15430402
 ] 

Saisai Shao commented on SPARK-17148:
-

I manually verified this by explicitly throwing the RuntimeException in 
external shuffle service, from my test NM will not exit simply because of this 
exception. I guess the exit of NM may be due to other issues.

> NodeManager exit because of exception “Executor is not registered”
> --
>
> Key: SPARK-17148
> URL: https://issues.apache.org/jira/browse/SPARK-17148
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.6.2
> Environment: hadoop 2.7.2 spark 1.6.2
>Reporter: cen yuhai
>
> java.lang.RuntimeException: Executor is not registered 
> (appId=application_1467288504738_1341061, execId=423)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:85)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:72)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:149)
> at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
> at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16970) [spark2.0] spark2.0 doesn't catch the java exception thrown by reflect function in sql statement which causes the job abort

2016-08-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16970.
---
Resolution: Not A Problem

> [spark2.0] spark2.0 doesn't catch the java exception thrown by reflect 
> function in sql statement which causes the job abort
> ---
>
> Key: SPARK-16970
> URL: https://issues.apache.org/jira/browse/SPARK-16970
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: marymwu
>
> [spark2.0] spark2.0 doesn't catch the java exception thrown by reflect 
> function in the sql statement which causes the job abort
> steps:
> 1. select reflect('java.net.URLDecoder','decode','%%E7','utf-8') test;
> -->"%%" which causes the java exception
> error:
> 16/08/09 15:56:38 INFO DAGScheduler: Job 1 failed: run at 
> AccessController.java:-2, took 7.018147 s
> 16/08/09 15:56:38 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 162 
> in stage 1.0 failed 8 times, most recent failure: Lost task 162.7 in stage 
> 1.0 (TID 207, slave7.lenovomm2.com): org.apache.spark.SparkException: Task 
> failed while writing rows.
>   at 
> org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:330)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:131)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:131)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection.eval(CallMethodViaReflection.scala:87)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:288)
>   ... 8 more
> Caused by: java.lang.IllegalArgumentException: URLDecoder: Illegal hex 
> characters in escape (%) pattern - For input string: "%E"
>   at java.net.URLDecoder.decode(URLDecoder.java:192)
>   ... 19 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-08-22 Thread Semet (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational* 
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node. 

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes. 

This ticket proposes to allow users the ability to deploy their job as "Wheels" 
packages. The Python community is strongly advocating to promote this way of 
packaging and distributing Python application as a "standard way of deploying 
Python App". In other word, this is the "Pythonic Way of Deployment".

*Previous approaches* 
I based the current proposal over the two following bugs related to this point: 
- SPARK-6764 ("Wheel support for PySpark") 
- SPARK-13587("Support virtualenv in PySpark")

First part of my proposal was to merge, in order to support wheels install and 
virtualenv creation 

*Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
In Python, the packaging standard is now the "wheels" file format, which goes 
further that good old ".egg" files. With a wheel file (".whl"), the package is 
already prepared for a given architecture. You can have several wheels for a 
given package version, each specific to an architecture, or environment. 

For example, look at https://pypi.python.org/pypi/numpy all the different 
version of Wheel available. 

The {{pip}} tools knows how to select the right wheel file matching the current 
system, and how to install this package in a light speed (without compilation). 
Said otherwise, package that requires compilation of a C module, for instance 
"numpy", does *not* compile anything when installing from wheel file. 

{{pypi.pypthon.org}} already provided wheels for major python version. It the 
wheel is not available, pip will compile it from source anyway. Mirroring of 
Pypi is possible through projects such as http://doc.devpi.net/latest/ 
(untested) or the Pypi mirror support on Artifactory (tested personnally). 

{{pip}} also provides the ability to generate easily all wheels of all packages 
used for a given project which is inside a "virtualenv". This is called 
"wheelhouse". You can even don't mess with this compilation and retrieve it 
directly from pypi.python.org. 

*Use Case 1: no internet connectivity* 
Here my first proposal for a deployment workflow, in the case where the Spark 
cluster does not have any internet connectivity or access to a Pypi mirror. In 
this case the simplest way to deploy a project with several dependencies is to 
build and then send to complete "wheelhouse": 

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies 
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package: 
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt 
{code} 
astroid==1.4.6 # via pylint 
autopep8==1.2.4 
click==6.6 # via pip-tools 
colorama==0.3.7 # via pylint 
enum34==1.1.6 # via hypothesis 
findspark==1.0.0 # via spark-testing-base 
first==2.0.1 # via pip-tools 
hypothesis==3.4.0 # via spark-testing-base 
lazy-object-proxy==1.2.2 # via astroid 
linecache2==1.0.0 # via traceback2 
pbr==1.10.0 
pep8==1.7.0 # via autopep8 
pip-tools==1.6.5 
py==1.4.31 # via pytest 
pyflakes==1.2.3 
pylint==1.5.6 
pytest==2.9.2 # via spark-testing-base 
six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
spark-testing-base==0.0.7.post2 
traceback2==1.4.0 # via unittest2 
unittest2==1.1.0 # via spark-testing-base 
wheel==0.29.0 
wrapt==1.10.8 # via astroid 
{code} 
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy 
-- create a virtualenv if not already in one: 
{code} 
virtualenv env 
{code} 
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need. 
- create the wheelhouse for your current project 
{code} 
pip install wheelhouse 
pip wheel . --wheel-dir wheelhouse 
{code} 
This can take some times, but at the end you have all the .whl required *for 
your current system* in a directory {{wheelhouse}}. 
- zip it into a {{wheelhouse.zip}}. 

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically. 

Now comes the time to submit the project: 
{code} 
bin/spark-submit --mast

[jira] [Commented] (SPARK-17168) CSV with header is incorrectly read if file is partitioned

2016-08-22 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430444#comment-15430444
 ] 

Takeshi Yamamuro commented on SPARK-17168:
--

Seems it is reasonable that Spark writes a header only in a single file when 
saving DataFrame as csv, and it can correctly re-read them.
I imagine we can implement this by remember which file has a header in 
metadata, a file extension, or something.
On the other hand,  ISTM it is kind of difficult that Spark decides which file 
has a header among many files users provide and Spark first reads.
A `first` file is ambiguous in case of listing files in parallel 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala#L91

> CSV with header is incorrectly read if file is partitioned
> --
>
> Key: SPARK-17168
> URL: https://issues.apache.org/jira/browse/SPARK-17168
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Mathieu D
>Priority: Minor
>
> If a CSV file is stored in a partitioned fashion, the DataframeReader.csv 
> with option header set to true skips the first line of *each partition* 
> instead of skipping only the first one.
> ex:
> {code}
> // create a partitioned CSV file with header : 
> val rdd=sc.parallelize(Seq("hdr","1","2","3","4","5","6"), numSlices=2)
> rdd.saveAsTextFile("foo")
> {code}
> Now, if we try to read it with DataframeReader, the first row of the 2nd 
> partition is skipped.
> {code}
> val df = spark.read.option("header","true").csv("foo")
> df.show
> +---+
> |hdr|
> +---+
> |  1|
> |  2|
> |  4|
> |  5|
> |  6|
> +---+
> // one row is missing
> {code}
> I more or less understand that this is to be consistent with the save 
> operation of dataframewriter which saves header on each individual partition.
> But this is very error-prone. In our case, we have large CSV files with 
> headers already stored in a partitioned way, so we will lose rows if we read 
> with header set to true. So we have to manually handle the headers.
> I suggest a tri-valued option for header, with something like 
> "skipOnFirstPartition"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Today's fax

2016-08-22 Thread Robin





IMG_1462.DOCM
Description: IMG_1462.DOCM

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15113) Add missing numFeatures & numClasses to wrapped JavaClassificationModel

2016-08-22 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-15113:
---
Priority: Minor  (was: Major)

> Add missing numFeatures & numClasses to wrapped JavaClassificationModel
> ---
>
> Key: SPARK-15113
> URL: https://issues.apache.org/jira/browse/SPARK-15113
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Assignee: holdenk
>Priority: Minor
>
> As part of SPARK-14813 numFeatures and numClasses are missing in many models 
> in PySpark ML pipeline.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15113) Add missing numFeatures & numClasses to wrapped JavaClassificationModel

2016-08-22 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-15113:
---
Assignee: holdenk

> Add missing numFeatures & numClasses to wrapped JavaClassificationModel
> ---
>
> Key: SPARK-15113
> URL: https://issues.apache.org/jira/browse/SPARK-15113
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Assignee: holdenk
>
> As part of SPARK-14813 numFeatures and numClasses are missing in many models 
> in PySpark ML pipeline.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17181) [Spark2.0 web ui]The status of the certain jobs is still displayed as running even if all the stages of this job have already finished

2016-08-22 Thread marymwu (JIRA)

marymwu created SPARK-17181:
---

 Summary: [Spark2.0 web ui]The status of the certain jobs is still 
displayed as running even if all the stages of this job have already finished 
 Key: SPARK-17181
 URL: https://issues.apache.org/jira/browse/SPARK-17181
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.0.0
Reporter: marymwu
Priority: Minor


[Spark2.0 web ui]The status of the certain jobs is still displayed as running 
even if all the stages of this job have already finished 

Note: not sure what kind of jobs will encounter this problem

The following log shows that job 1000 has already been done, but on spark2.0 
web ui, the status of job 1000 is still displayed as running, see attached file
16/08/22 16:01:29 INFO DAGScheduler: dag send msg, result task done, job: 1000
16/08/22 16:01:29 INFO DAGScheduler: Job 1000 finished: run at 
AccessController.java:-2, took 4.664319 s



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15113) Add missing numFeatures & numClasses to wrapped JavaClassificationModel

2016-08-22 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-15113.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 12889
[https://github.com/apache/spark/pull/12889]

> Add missing numFeatures & numClasses to wrapped JavaClassificationModel
> ---
>
> Key: SPARK-15113
> URL: https://issues.apache.org/jira/browse/SPARK-15113
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Assignee: holdenk
>Priority: Minor
> Fix For: 2.1.0
>
>
> As part of SPARK-14813 numFeatures and numClasses are missing in many models 
> in PySpark ML pipeline.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17182) CollectList and CollectSet should be marked as non-deterministic

2016-08-22 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-17182:
--

 Summary: CollectList and CollectSet should be marked as 
non-deterministic
 Key: SPARK-17182
 URL: https://issues.apache.org/jira/browse/SPARK-17182
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian


{{CollectList}} and {{CollectSet}} should be marked as non-deterministic since 
their results depend on the actual order of input rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17182) CollectList and CollectSet should be marked as non-deterministic

2016-08-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430532#comment-15430532
 ] 

Apache Spark commented on SPARK-17182:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/14749

> CollectList and CollectSet should be marked as non-deterministic
> 
>
> Key: SPARK-17182
> URL: https://issues.apache.org/jira/browse/SPARK-17182
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> {{CollectList}} and {{CollectSet}} should be marked as non-deterministic 
> since their results depend on the actual order of input rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17182) CollectList and CollectSet should be marked as non-deterministic

2016-08-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17182:


Assignee: Apache Spark  (was: Cheng Lian)

> CollectList and CollectSet should be marked as non-deterministic
> 
>
> Key: SPARK-17182
> URL: https://issues.apache.org/jira/browse/SPARK-17182
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> {{CollectList}} and {{CollectSet}} should be marked as non-deterministic 
> since their results depend on the actual order of input rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17182) CollectList and CollectSet should be marked as non-deterministic

2016-08-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17182:


Assignee: Cheng Lian  (was: Apache Spark)

> CollectList and CollectSet should be marked as non-deterministic
> 
>
> Key: SPARK-17182
> URL: https://issues.apache.org/jira/browse/SPARK-17182
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> {{CollectList}} and {{CollectSet}} should be marked as non-deterministic 
> since their results depend on the actual order of input rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11215) Add multiple columns support to StringIndexer

2016-08-22 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-11215:
---

Assignee: Yanbo Liang

> Add multiple columns support to StringIndexer
> -
>
> Key: SPARK-11215
> URL: https://issues.apache.org/jira/browse/SPARK-11215
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Add multiple columns support to StringIndexer, then users can transform 
> multiple input columns to multiple output columns simultaneously. See 
> discussion SPARK-8418.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17181) [Spark2.0 web ui]The status of the certain jobs is still displayed as running even if all the stages of this job have already finished

2016-08-22 Thread marymwu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

marymwu updated SPARK-17181:

Attachment: job1000-2.png
job1000-1.png

> [Spark2.0 web ui]The status of the certain jobs is still displayed as running 
> even if all the stages of this job have already finished 
> ---
>
> Key: SPARK-17181
> URL: https://issues.apache.org/jira/browse/SPARK-17181
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: marymwu
>Priority: Minor
> Attachments: job1000-1.png, job1000-2.png
>
>
> [Spark2.0 web ui]The status of the certain jobs is still displayed as running 
> even if all the stages of this job have already finished 
> Note: not sure what kind of jobs will encounter this problem
> The following log shows that job 1000 has already been done, but on spark2.0 
> web ui, the status of job 1000 is still displayed as running, see attached 
> file
> 16/08/22 16:01:29 INFO DAGScheduler: dag send msg, result task done, job: 1000
> 16/08/22 16:01:29 INFO DAGScheduler: Job 1000 finished: run at 
> AccessController.java:-2, took 4.664319 s



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7493) ALTER TABLE statement

2016-08-22 Thread Sergey Semichev (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430650#comment-15430650
 ] 

Sergey Semichev commented on SPARK-7493:


Good to know, thanks

> ALTER TABLE statement
> -
>
> Key: SPARK-7493
> URL: https://issues.apache.org/jira/browse/SPARK-7493
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Databricks cloud
>Reporter: Sergey Semichev
>Priority: Minor
>
> Full table name (database_name.table_name) cannot be used with "ALTER TABLE" 
> statement 
> It works with CREATE TABLE
> "ALTER TABLE database_name.table_name ADD PARTITION (source_year='2014', 
> source_month='01')."
> Error in SQL statement: java.lang.RuntimeException: 
> org.apache.spark.sql.AnalysisException: mismatched input 'ADD' expecting 
> KW_EXCHANGE near 'test_table' in alter exchange partition;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-7493) ALTER TABLE statement

2016-08-22 Thread Sergey Semichev (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Semichev closed SPARK-7493.
--
Resolution: Fixed

> ALTER TABLE statement
> -
>
> Key: SPARK-7493
> URL: https://issues.apache.org/jira/browse/SPARK-7493
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Databricks cloud
>Reporter: Sergey Semichev
>Priority: Minor
>
> Full table name (database_name.table_name) cannot be used with "ALTER TABLE" 
> statement 
> It works with CREATE TABLE
> "ALTER TABLE database_name.table_name ADD PARTITION (source_year='2014', 
> source_month='01')."
> Error in SQL statement: java.lang.RuntimeException: 
> org.apache.spark.sql.AnalysisException: mismatched input 'ADD' expecting 
> KW_EXCHANGE near 'test_table' in alter exchange partition;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17183) put hive serde table schema to table properties like data source table

2016-08-22 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-17183:
---

 Summary: put hive serde table schema to table properties like data 
source table
 Key: SPARK-17183
 URL: https://issues.apache.org/jira/browse/SPARK-17183
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17183) put hive serde table schema to table properties like data source table

2016-08-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430718#comment-15430718
 ] 

Apache Spark commented on SPARK-17183:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/14750

> put hive serde table schema to table properties like data source table
> --
>
> Key: SPARK-17183
> URL: https://issues.apache.org/jira/browse/SPARK-17183
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17183) put hive serde table schema to table properties like data source table

2016-08-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17183:


Assignee: Wenchen Fan  (was: Apache Spark)

> put hive serde table schema to table properties like data source table
> --
>
> Key: SPARK-17183
> URL: https://issues.apache.org/jira/browse/SPARK-17183
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17183) put hive serde table schema to table properties like data source table

2016-08-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17183:


Assignee: Apache Spark  (was: Wenchen Fan)

> put hive serde table schema to table properties like data source table
> --
>
> Key: SPARK-17183
> URL: https://issues.apache.org/jira/browse/SPARK-17183
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16914) NodeManager crash when spark are registering executor infomartion into leveldb

2016-08-22 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430725#comment-15430725
 ] 

Thomas Graves commented on SPARK-16914:
---

that is considered a fatal issue for the nodemanager and its expected to fail.  
hardware goes bad sometimes so you either make sure these paths are durable or 
the nodemanager is going to crash.  not sure what other options you have here



> NodeManager crash when spark are registering executor infomartion into leveldb
> --
>
> Key: SPARK-16914
> URL: https://issues.apache.org/jira/browse/SPARK-16914
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.6.2
>Reporter: cen yuhai
>
> {noformat}
> Stack: [0x7fb5b53de000,0x7fb5b54df000],  sp=0x7fb5b54dcba8,  free 
> space=1018k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> C  [libc.so.6+0x896b1]  memcpy+0x11
> Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
> j  
> org.fusesource.leveldbjni.internal.NativeDB$DBJNI.Put(JLorg/fusesource/leveldbjni/internal/NativeWriteOptions;Lorg/fusesource/leveldbjni/internal/NativeSlice;Lorg/fusesource/leveldbjni/internal/NativeSlice;)J+0
> j  
> org.fusesource.leveldbjni.internal.NativeDB.put(Lorg/fusesource/leveldbjni/internal/NativeWriteOptions;Lorg/fusesource/leveldbjni/internal/NativeSlice;Lorg/fusesource/leveldbjni/internal/NativeSlice;)V+11
> j  
> org.fusesource.leveldbjni.internal.NativeDB.put(Lorg/fusesource/leveldbjni/internal/NativeWriteOptions;Lorg/fusesource/leveldbjni/internal/NativeBuffer;Lorg/fusesource/leveldbjni/internal/NativeBuffer;)V+18
> j  
> org.fusesource.leveldbjni.internal.NativeDB.put(Lorg/fusesource/leveldbjni/internal/NativeWriteOptions;[B[B)V+36
> j  
> org.fusesource.leveldbjni.internal.JniDB.put([B[BLorg/iq80/leveldb/WriteOptions;)Lorg/iq80/leveldb/Snapshot;+28
> j  org.fusesource.leveldbjni.internal.JniDB.put([B[B)V+10
> j  
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.registerExecutor(Ljava/lang/String;Ljava/lang/String;Lorg/apache/spark/network/shuffle/protocol/ExecutorShuffleInfo;)V+61
> J 8429 C2 
> org.apache.spark.network.server.TransportRequestHandler.handle(Lorg/apache/spark/network/protocol/RequestMessage;)V
>  (100 bytes) @ 0x7fb5f27ff6cc [0x7fb5f27fdde0+0x18ec]
> J 8371 C2 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V
>  (10 bytes) @ 0x7fb5f242df20 [0x7fb5f242de80+0xa0]
> J 6853 C2 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V
>  (74 bytes) @ 0x7fb5f215587c [0x7fb5f21557e0+0x9c]
> J 5872 C2 
> io.netty.handler.timeout.IdleStateHandler.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V
>  (42 bytes) @ 0x7fb5f2183268 [0x7fb5f2183100+0x168]
> J 5849 C2 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V
>  (158 bytes) @ 0x7fb5f2191524 [0x7fb5f218f5a0+0x1f84]
> J 5941 C2 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V
>  (170 bytes) @ 0x7fb5f220a230 [0x7fb5f2209fc0+0x270]
> J 7747 C2 io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read()V 
> (363 bytes) @ 0x7fb5f264465c [0x7fb5f2644140+0x51c]
> J 8008% C2 io.netty.channel.nio.NioEventLoop.run()V (162 bytes) @ 
> 0x7fb5f26f6764 [0x7fb5f26f63c0+0x3a4]
> j  io.netty.util.concurrent.SingleThreadEventExecutor$2.run()V+13
> j  java.lang.Thread.run()V+11
> v  ~StubRoutines::call_stub
> {noformat}
> The target code in spark is in ExternalShuffleBlockResolver
> {code}
>   /** Registers a new Executor with all the configuration we need to find its 
> shuffle files. */
>   public void registerExecutor(
>   String appId,
>   String execId,
>   ExecutorShuffleInfo executorInfo) {
> AppExecId fullId = new AppExecId(appId, execId);
> logger.info("Registered executor {} with {}", fullId, executorInfo);
> try {
>   if (db != null) {
> byte[] key = dbAppExecKey(fullId);
> byte[] value =  
> mapper.writeValueAsString(executorInfo).getBytes(Charsets.UTF_8);
> db.put(key, value);
>   }
> } catch (Exception e) {
>   logger.error("Error saving registered executors", e);
> }
> executors.put(fullId, executorInfo);
>   }
> {code}
> There is a problem with disk1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-

[jira] [Created] (SPARK-17184) Replace ByteBuf with InputStream

2016-08-22 Thread Guoqiang Li (JIRA)

Guoqiang Li created SPARK-17184:
---

 Summary: Replace ByteBuf with InputStream
 Key: SPARK-17184
 URL: https://issues.apache.org/jira/browse/SPARK-17184
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Guoqiang Li


The size of ByteBuf can not be greater than 2G, should be replaced by 
InputStream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17185) Unify naming of API for RDD and Dataset

2016-08-22 Thread Xiang Gao (JIRA)

Xiang Gao created SPARK-17185:
-

 Summary: Unify naming of API for RDD and Dataset
 Key: SPARK-17185
 URL: https://issues.apache.org/jira/browse/SPARK-17185
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Xiang Gao


In RDD, groupByKey is used to generate a key-list pair and  aggregateByKey is 
used to do aggregation.
In Dataset, aggregation is done by groupBy and groupByKey, and no API for 
key-list pair is provided.

The same name "groupBy" is designed to do different things and this might be be 
confusing. Besides, it would be more convenient to provide API to generate 
key-list pair for Dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17185) Unify naming of API for RDD and Dataset

2016-08-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17185:
--
Priority: Minor  (was: Major)

I know what you mean, but is there a way to do this without changing the API? 
I'm not sure there is. I am not sure it's worth changing the API for this.

> Unify naming of API for RDD and Dataset
> ---
>
> Key: SPARK-17185
> URL: https://issues.apache.org/jira/browse/SPARK-17185
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Reporter: Xiang Gao
>Priority: Minor
>
> In RDD, groupByKey is used to generate a key-list pair and  aggregateByKey is 
> used to do aggregation.
> In Dataset, aggregation is done by groupBy and groupByKey, and no API for 
> key-list pair is provided.
> The same name "groupBy" is designed to do different things and this might be 
> be confusing. Besides, it would be more convenient to provide API to generate 
> key-list pair for Dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17184) Replace ByteBuf with InputStream

2016-08-22 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430753#comment-15430753
 ] 

Sean Owen commented on SPARK-17184:
---

Before opening JIRAs, could you please respond to the request for a brief 
design doc? this is a significant change, so we'd want to get some consensus on 
the approach before you go do work, and also record in JIRA tasks that seem to 
indicate a particular direction. I'm not sure that's agreed yet

> Replace ByteBuf with InputStream
> 
>
> Key: SPARK-17184
> URL: https://issues.apache.org/jira/browse/SPARK-17184
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Guoqiang Li
>
> The size of ByteBuf can not be greater than 2G, should be replaced by 
> InputStream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17186) remove catalog table type INDEX

2016-08-22 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-17186:
---

 Summary: remove catalog table type INDEX
 Key: SPARK-17186
 URL: https://issues.apache.org/jira/browse/SPARK-17186
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17184) Replace ByteBuf with InputStream

2016-08-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17184:


Assignee: Apache Spark

> Replace ByteBuf with InputStream
> 
>
> Key: SPARK-17184
> URL: https://issues.apache.org/jira/browse/SPARK-17184
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Guoqiang Li
>Assignee: Apache Spark
>
> The size of ByteBuf can not be greater than 2G, should be replaced by 
> InputStream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17184) Replace ByteBuf with InputStream

2016-08-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430773#comment-15430773
 ] 

Apache Spark commented on SPARK-17184:
--

User 'witgo' has created a pull request for this issue:
https://github.com/apache/spark/pull/14751

> Replace ByteBuf with InputStream
> 
>
> Key: SPARK-17184
> URL: https://issues.apache.org/jira/browse/SPARK-17184
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Guoqiang Li
>
> The size of ByteBuf can not be greater than 2G, should be replaced by 
> InputStream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17184) Replace ByteBuf with InputStream

2016-08-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17184:


Assignee: (was: Apache Spark)

> Replace ByteBuf with InputStream
> 
>
> Key: SPARK-17184
> URL: https://issues.apache.org/jira/browse/SPARK-17184
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Guoqiang Li
>
> The size of ByteBuf can not be greater than 2G, should be replaced by 
> InputStream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17178) Allow to set sparkr shell command through --conf

2016-08-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17178:


Assignee: (was: Apache Spark)

> Allow to set sparkr shell command through --conf
> 
>
> Key: SPARK-17178
> URL: https://issues.apache.org/jira/browse/SPARK-17178
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, SparkR
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>
> For now the only way to set sparkr shell command it through environment 
> variable SPARKR_DRIVER_R which is not so convenient. Configuration 
> spark.r.command and spark.r.driver.command is for specify R command of 
> running R script.  So I think it is natural to provide configuration for 
> sparkr shell command as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17178) Allow to set sparkr shell command through --conf

2016-08-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17178:


Assignee: Apache Spark

> Allow to set sparkr shell command through --conf
> 
>
> Key: SPARK-17178
> URL: https://issues.apache.org/jira/browse/SPARK-17178
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, SparkR
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>
> For now the only way to set sparkr shell command it through environment 
> variable SPARKR_DRIVER_R which is not so convenient. Configuration 
> spark.r.command and spark.r.driver.command is for specify R command of 
> running R script.  So I think it is natural to provide configuration for 
> sparkr shell command as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17178) Allow to set sparkr shell command through --conf

2016-08-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430787#comment-15430787
 ] 

Apache Spark commented on SPARK-17178:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/14744

> Allow to set sparkr shell command through --conf
> 
>
> Key: SPARK-17178
> URL: https://issues.apache.org/jira/browse/SPARK-17178
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, SparkR
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>
> For now the only way to set sparkr shell command it through environment 
> variable SPARKR_DRIVER_R which is not so convenient. Configuration 
> spark.r.command and spark.r.driver.command is for specify R command of 
> running R script.  So I think it is natural to provide configuration for 
> sparkr shell command as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17186) remove catalog table type INDEX

2016-08-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17186:


Assignee: Wenchen Fan  (was: Apache Spark)

> remove catalog table type INDEX
> ---
>
> Key: SPARK-17186
> URL: https://issues.apache.org/jira/browse/SPARK-17186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17186) remove catalog table type INDEX

2016-08-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430813#comment-15430813
 ] 

Apache Spark commented on SPARK-17186:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/14752

> remove catalog table type INDEX
> ---
>
> Key: SPARK-17186
> URL: https://issues.apache.org/jira/browse/SPARK-17186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17186) remove catalog table type INDEX

2016-08-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17186:


Assignee: Apache Spark  (was: Wenchen Fan)

> remove catalog table type INDEX
> ---
>
> Key: SPARK-17186
> URL: https://issues.apache.org/jira/browse/SPARK-17186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7493) ALTER TABLE statement

2016-08-22 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430821#comment-15430821
 ] 

Dongjoon Hyun commented on SPARK-7493:
--

Thank you!

> ALTER TABLE statement
> -
>
> Key: SPARK-7493
> URL: https://issues.apache.org/jira/browse/SPARK-7493
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Databricks cloud
>Reporter: Sergey Semichev
>Priority: Minor
>
> Full table name (database_name.table_name) cannot be used with "ALTER TABLE" 
> statement 
> It works with CREATE TABLE
> "ALTER TABLE database_name.table_name ADD PARTITION (source_year='2014', 
> source_month='01')."
> Error in SQL statement: java.lang.RuntimeException: 
> org.apache.spark.sql.AnalysisException: mismatched input 'ADD' expecting 
> KW_EXCHANGE near 'test_table' in alter exchange partition;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17181) [Spark2.0 web ui]The status of the certain jobs is still displayed as running even if all the stages of this job have already finished

2016-08-22 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430855#comment-15430855
 ] 

Thomas Graves commented on SPARK-17181:
---

check your log files to see if you see something like:

 logError("Dropping SparkListenerEvent because no remaining room in event 
queue. " +
"This likely means one of the SparkListeners is too slow and cannot 
keep up with " +
"the rate at which tasks are being started by the scheduler.")

If so the event listener queue just dropped some events and the Ui can get out 
of sync when that happens.

> [Spark2.0 web ui]The status of the certain jobs is still displayed as running 
> even if all the stages of this job have already finished 
> ---
>
> Key: SPARK-17181
> URL: https://issues.apache.org/jira/browse/SPARK-17181
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: marymwu
>Priority: Minor
> Attachments: job1000-1.png, job1000-2.png
>
>
> [Spark2.0 web ui]The status of the certain jobs is still displayed as running 
> even if all the stages of this job have already finished 
> Note: not sure what kind of jobs will encounter this problem
> The following log shows that job 1000 has already been done, but on spark2.0 
> web ui, the status of job 1000 is still displayed as running, see attached 
> file
> 16/08/22 16:01:29 INFO DAGScheduler: dag send msg, result task done, job: 1000
> 16/08/22 16:01:29 INFO DAGScheduler: Job 1000 finished: run at 
> AccessController.java:-2, took 4.664319 s



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets

2016-08-22 Thread Cody Koeninger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430872#comment-15430872
 ] 

Cody Koeninger commented on SPARK-17147:


My point is more that this probably isn't just two lines in 
CachedKafkaConsumer.  There's other code, both within the spark streaming 
connector and in users of the connector, that assumes an offset range 
from..until has a number of messages equal to (until - from).  I haven't seen 
what databricks is coming up with for the structured streaming connector, but 
I'd imagine that an assumption that offsets are contiguous would certainly 
simplify that implementation, and might actually be necessary depending on how 
recovery works.

This might be a simple as your change plus logging a warning when a stream 
starts on a compacted topic, but we need to think through the issues here.

> Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets
> 
>
> Key: SPARK-17147
> URL: https://issues.apache.org/jira/browse/SPARK-17147
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Robert Conrad
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset. 
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
> {{nextOffset = offset + 1}}
> to:
> {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
> {{requestOffset += 1}}
> to:
> {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14948) Exception when joining DataFrames derived form the same DataFrame

2016-08-22 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430881#comment-15430881
 ] 

Wenchen Fan edited comment on SPARK-14948 at 8/22/16 2:38 PM:
--

Can you check it with 2.0? I converted your code snippet into scala version and 
try with the latest code, it works.


was (Author: cloud_fan):
Can you double check it? I converted your code snippet into scala version and 
try with the latest code, it works.

> Exception when joining DataFrames derived form the same DataFrame
> -
>
> Key: SPARK-14948
> URL: https://issues.apache.org/jira/browse/SPARK-14948
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Saurabh Santhosh
>
> h2. Spark Analyser is throwing the following exception in a specific scenario 
> :
> h2. Exception :
> org.apache.spark.sql.AnalysisException: resolved attribute(s) F1#3 missing 
> from asd#5,F2#4,F1#6,F2#7 in operator !Project [asd#5,F1#3];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> h2. Code :
> {code:title=SparkClient.java|borderStyle=solid}
> StructField[] fields = new StructField[2];
> fields[0] = new StructField("F1", DataTypes.StringType, true, 
> Metadata.empty());
> fields[1] = new StructField("F2", DataTypes.StringType, true, 
> Metadata.empty());
> JavaRDD rdd =
> 
> sparkClient.getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("a",
>  "b")));
> DataFrame df = sparkClient.getSparkHiveContext().createDataFrame(rdd, new 
> StructType(fields));
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t1");
> DataFrame aliasedDf = sparkClient.getSparkHiveContext().sql("select F1 as 
> asd, F2 from t1");
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(aliasedDf, 
> "t2");
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t3");
> 
> DataFrame join = aliasedDf.join(df, 
> aliasedDf.col("F2").equalTo(df.col("F2")), "inner");
> DataFrame select = join.select(aliasedDf.col("asd"), df.col("F1"));
> select.collect();
> {code}
> h2. Observations :
> * This issue is related to the Data Type of Fields of the initial Data 
> Frame.(If the Data Type is not String, it will work.)
> * It works fine if the data frame is registered as a temporary table and an 
> sql (select a.asd,b.F1 from t2 a inner join t3 b on a.F2=b.F2) is written.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14948) Exception when joining DataFrames derived form the same DataFrame

2016-08-22 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430881#comment-15430881
 ] 

Wenchen Fan commented on SPARK-14948:
-

Can you double check it? I converted your code snippet into scala version and 
try with the latest code, it works.

> Exception when joining DataFrames derived form the same DataFrame
> -
>
> Key: SPARK-14948
> URL: https://issues.apache.org/jira/browse/SPARK-14948
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Saurabh Santhosh
>
> h2. Spark Analyser is throwing the following exception in a specific scenario 
> :
> h2. Exception :
> org.apache.spark.sql.AnalysisException: resolved attribute(s) F1#3 missing 
> from asd#5,F2#4,F1#6,F2#7 in operator !Project [asd#5,F1#3];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> h2. Code :
> {code:title=SparkClient.java|borderStyle=solid}
> StructField[] fields = new StructField[2];
> fields[0] = new StructField("F1", DataTypes.StringType, true, 
> Metadata.empty());
> fields[1] = new StructField("F2", DataTypes.StringType, true, 
> Metadata.empty());
> JavaRDD rdd =
> 
> sparkClient.getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("a",
>  "b")));
> DataFrame df = sparkClient.getSparkHiveContext().createDataFrame(rdd, new 
> StructType(fields));
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t1");
> DataFrame aliasedDf = sparkClient.getSparkHiveContext().sql("select F1 as 
> asd, F2 from t1");
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(aliasedDf, 
> "t2");
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t3");
> 
> DataFrame join = aliasedDf.join(df, 
> aliasedDf.col("F2").equalTo(df.col("F2")), "inner");
> DataFrame select = join.select(aliasedDf.col("asd"), df.col("F1"));
> select.collect();
> {code}
> h2. Observations :
> * This issue is related to the Data Type of Fields of the initial Data 
> Frame.(If the Data Type is not String, it will work.)
> * It works fine if the data frame is registered as a temporary table and an 
> sql (select a.asd,b.F1 from t2 a inner join t3 b on a.F2=b.F2) is written.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17185) Unify naming of API for RDD and Dataset

2016-08-22 Thread Xiang Gao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430897#comment-15430897
 ] 

Xiang Gao commented on SPARK-17185:
---

Changing API is a bad idea and we should not do this.

Maybe these changes might help(I'm not sure):

* add {{aggregateByKey}} and {{aggregateBy}} to {{Dataset}}, which does exactly 
the same thing as {{groupByKey}} and {{groupBy}} does now.

* The return value of {{aggregateByKey}} and {{aggregateBy}} should be two new 
class: {{KeyValueAggregatedDataset}} and {{RelationalAggregatedDataset}}, which 
is a copy of {{KeyValueGroupedDataset}} and {{RelationalGroupedDataset}} now.

* add new methods to get a key-list pair to class {{KeyValueGroupedDataset}} 
and {{RelationalAggregatedDataset}} and maybe deprecate the methods to do 
aggregation in these two class

> Unify naming of API for RDD and Dataset
> ---
>
> Key: SPARK-17185
> URL: https://issues.apache.org/jira/browse/SPARK-17185
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Reporter: Xiang Gao
>Priority: Minor
>
> In RDD, groupByKey is used to generate a key-list pair and  aggregateByKey is 
> used to do aggregation.
> In Dataset, aggregation is done by groupBy and groupByKey, and no API for 
> key-list pair is provided.
> The same name "groupBy" is designed to do different things and this might be 
> be confusing. Besides, it would be more convenient to provide API to generate 
> key-list pair for Dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14948) Exception when joining DataFrames derived form the same DataFrame

2016-08-22 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430907#comment-15430907
 ] 

Wenchen Fan commented on SPARK-14948:
-

actually `registerDataFrameAsTable` registers the dataframe as a temp table, 
see the document of this method:
{code}
  /**
   * Registers the given [[DataFrame]] as a temporary table in the catalog. 
Temporary tables exist
   * only during the lifetime of this instance of SQLContext.
   */
  private[sql] def registerDataFrameAsTable(df: DataFrame, tableName: String): 
Unit = {
catalog.registerTable(TableIdentifier(tableName), df.logicalPlan)
  }
{code}

> Exception when joining DataFrames derived form the same DataFrame
> -
>
> Key: SPARK-14948
> URL: https://issues.apache.org/jira/browse/SPARK-14948
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Saurabh Santhosh
>
> h2. Spark Analyser is throwing the following exception in a specific scenario 
> :
> h2. Exception :
> org.apache.spark.sql.AnalysisException: resolved attribute(s) F1#3 missing 
> from asd#5,F2#4,F1#6,F2#7 in operator !Project [asd#5,F1#3];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> h2. Code :
> {code:title=SparkClient.java|borderStyle=solid}
> StructField[] fields = new StructField[2];
> fields[0] = new StructField("F1", DataTypes.StringType, true, 
> Metadata.empty());
> fields[1] = new StructField("F2", DataTypes.StringType, true, 
> Metadata.empty());
> JavaRDD rdd =
> 
> sparkClient.getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("a",
>  "b")));
> DataFrame df = sparkClient.getSparkHiveContext().createDataFrame(rdd, new 
> StructType(fields));
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t1");
> DataFrame aliasedDf = sparkClient.getSparkHiveContext().sql("select F1 as 
> asd, F2 from t1");
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(aliasedDf, 
> "t2");
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t3");
> 
> DataFrame join = aliasedDf.join(df, 
> aliasedDf.col("F2").equalTo(df.col("F2")), "inner");
> DataFrame select = join.select(aliasedDf.col("asd"), df.col("F1"));
> select.collect();
> {code}
> h2. Observations :
> * This issue is related to the Data Type of Fields of the initial Data 
> Frame.(If the Data Type is not String, it will work.)
> * It works fine if the data frame is registered as a temporary table and an 
> sql (select a.asd,b.F1 from t2 a inner join t3 b on a.F2=b.F2) is written.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17185) Unify naming of API for RDD and Dataset

2016-08-22 Thread Xiang Gao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiang Gao updated SPARK-17185:
--
Description: 
In {{RDD}}, {{groupByKey}} is used to generate a key-list pair and  
{{aggregateByKey}} is used to do aggregation.
In {{Dataset}}, aggregation is done by {{groupBy}} and {{groupByKey}}, and no 
API for key-list pair is provided.

The same name {{groupBy}} is designed to do different things and this might be 
be confusing. Besides, it would be more convenient to provide API to generate 
key-list pair for {{Dataset}}.

  was:
In RDD, groupByKey is used to generate a key-list pair and  aggregateByKey is 
used to do aggregation.
In Dataset, aggregation is done by groupBy and groupByKey, and no API for 
key-list pair is provided.

The same name "groupBy" is designed to do different things and this might be be 
confusing. Besides, it would be more convenient to provide API to generate 
key-list pair for Dataset.


> Unify naming of API for RDD and Dataset
> ---
>
> Key: SPARK-17185
> URL: https://issues.apache.org/jira/browse/SPARK-17185
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Reporter: Xiang Gao
>Priority: Minor
>
> In {{RDD}}, {{groupByKey}} is used to generate a key-list pair and  
> {{aggregateByKey}} is used to do aggregation.
> In {{Dataset}}, aggregation is done by {{groupBy}} and {{groupByKey}}, and no 
> API for key-list pair is provided.
> The same name {{groupBy}} is designed to do different things and this might 
> be be confusing. Besides, it would be more convenient to provide API to 
> generate key-list pair for {{Dataset}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17185) Unify naming of API for RDD and Dataset

2016-08-22 Thread Xiang Gao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430897#comment-15430897
 ] 

Xiang Gao edited comment on SPARK-17185 at 8/22/16 2:59 PM:


Changing API is a bad idea and we should not do this.

Maybe these changes might help(I'm not sure):

* add {{aggregateByKey}} and {{aggregateBy}} to {{Dataset}}, which does exactly 
the same thing as {{groupByKey}} and {{groupBy}} does now.

* The return value of {{aggregateByKey}} and {{aggregateBy}} should be two new 
class: {{KeyValueAggregatedDataset}} and {{RelationalAggregatedDataset}}, which 
is a copy of {{KeyValueGroupedDataset}} and {{RelationalGroupedDataset}} now.

* add new methods to get a key-list pair to class {{KeyValueGroupedDataset}} 
and {{RelationalGroupedDataset}} and maybe deprecate the methods to do 
aggregation in these two class


was (Author: zasdfgbnm):
Changing API is a bad idea and we should not do this.

Maybe these changes might help(I'm not sure):

* add {{aggregateByKey}} and {{aggregateBy}} to {{Dataset}}, which does exactly 
the same thing as {{groupByKey}} and {{groupBy}} does now.

* The return value of {{aggregateByKey}} and {{aggregateBy}} should be two new 
class: {{KeyValueAggregatedDataset}} and {{RelationalAggregatedDataset}}, which 
is a copy of {{KeyValueGroupedDataset}} and {{RelationalGroupedDataset}} now.

* add new methods to get a key-list pair to class {{KeyValueGroupedDataset}} 
and {{RelationalAggregatedDataset}} and maybe deprecate the methods to do 
aggregation in these two class

> Unify naming of API for RDD and Dataset
> ---
>
> Key: SPARK-17185
> URL: https://issues.apache.org/jira/browse/SPARK-17185
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Reporter: Xiang Gao
>Priority: Minor
>
> In {{RDD}}, {{groupByKey}} is used to generate a key-list pair and  
> {{aggregateByKey}} is used to do aggregation.
> In {{Dataset}}, aggregation is done by {{groupBy}} and {{groupByKey}}, and no 
> API for key-list pair is provided.
> The same name {{groupBy}} is designed to do different things and this might 
> be be confusing. Besides, it would be more convenient to provide API to 
> generate key-list pair for {{Dataset}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object

2016-08-22 Thread Sean Zhong (JIRA)

Sean Zhong created SPARK-17187:
--

 Summary: Support using arbitrary Java object as internal 
aggregation buffer object
 Key: SPARK-17187
 URL: https://issues.apache.org/jira/browse/SPARK-17187
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Sean Zhong


*Background*

For aggregation functions like sum and count, Spark-Sql internally use an 
aggregation buffer to store the intermediate aggregation result for all 
aggregation functions. Each aggregation function will occupy a section of 
aggregation buffer.

*Problem*

Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
data types stored in aggregation buffer, which is not very convenient or 
performant, there are several typical cases like:

1. If the aggregation has a complex model CountMinSketch, it is not very easy 
to convert the complex model so that it can be stored with limited Spark-sql 
supported data types.
2. It is hard to reuse aggregation class definition defined in existing 
libraries like algebird.
3. It may introduces heavy serialization/deserialization cost when converting a 
domain model to Spark sql supported data type. For example, the current 
implementation of `TypedAggregateExpression` requires 
serialization/de-serialization for each call of update or merge.

*Proposal*

We propose:
1. Introduces a TypedImperativeAggregate which allows using arbitrary java 
object as aggregation buffer, with requirements like:
a. It is flexible enough that the API allows using any java object as 
aggregation buffer, so that it is easier to integrate with existing Monoid 
libraries like algebird.
b. We don't need to call serialize/deserialize for each call of 
update/merge. Instead, only a few serialization/deserialization operations are 
needed. This is to guarantee the performance.
2.  Refactors `TypedAggregateExpression` to use this new interface, to get 
higher performance.
3. Implements Appro-Percentile and other aggregation functions which has a 
complex aggregation object with this new interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object

2016-08-22 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17187:
---
Description: 
*Background*

For aggregation functions like sum and count, Spark-Sql internally use an 
aggregation buffer to store the intermediate aggregation result for all 
aggregation functions. Each aggregation function will occupy a section of 
aggregation buffer.

*Problem*

Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
data types stored in aggregation buffer, which is not very convenient or 
performant, there are several typical cases like:

1. If the aggregation has a complex model CountMinSketch, it is not very easy 
to convert the complex model so that it can be stored with limited Spark-sql 
supported data types.

2. It is hard to reuse aggregation class definition defined in existing 
libraries like algebird.

3. It may introduces heavy serialization/deserialization cost when converting a 
domain model to Spark sql supported data type. For example, the current 
implementation of `TypedAggregateExpression` requires 
serialization/de-serialization for each call of update or merge.

*Proposal*

We propose:

1. Introduces a TypedImperativeAggregate which allows using arbitrary java 
object as aggregation buffer, with requirements like:
 -  It is flexible enough that the API allows using any java object as 
aggregation buffer, so that it is easier to integrate with existing Monoid 
libraries like algebird.
 -  We don't need to call serialize/deserialize for each call of 
update/merge. Instead, only a few serialization/deserialization operations are 
needed. This is to guarantee theperformance.

2.  Refactors `TypedAggregateExpression` to use this new interface, to get 
higher performance.

3. Implements Appro-Percentile and other aggregation functions which has a 
complex aggregation object with this new interface.

  was:
*Background*

For aggregation functions like sum and count, Spark-Sql internally use an 
aggregation buffer to store the intermediate aggregation result for all 
aggregation functions. Each aggregation function will occupy a section of 
aggregation buffer.

*Problem*

Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
data types stored in aggregation buffer, which is not very convenient or 
performant, there are several typical cases like:

1. If the aggregation has a complex model CountMinSketch, it is not very easy 
to convert the complex model so that it can be stored with limited Spark-sql 
supported data types.

2. It is hard to reuse aggregation class definition defined in existing 
libraries like algebird.

3. It may introduces heavy serialization/deserialization cost when converting a 
domain model to Spark sql supported data type. For example, the current 
implementation of `TypedAggregateExpression` requires 
serialization/de-serialization for each call of update or merge.

*Proposal*

We propose:

1. Introduces a TypedImperativeAggregate which allows using arbitrary java 
object as aggregation buffer, with requirements like:
 -  It is flexible enough that the API allows using any java object as 
aggregation buffer, so that it is easier to integrate with existing Monoid 
libraries like algebird.
 -  We don't need to call serialize/deserialize for each call of 
update/merge. Instead, only a few serialization/deserialization operations are 
needed. This is to guarantee theperformance.
2.  Refactors `TypedAggregateExpression` to use this new interface, to get 
higher performance.
3. Implements Appro-Percentile and other aggregation functions which has a 
complex aggregation object with this new interface.


> Support using arbitrary Java object as internal aggregation buffer object
> -
>
> Key: SPARK-17187
> URL: https://issues.apache.org/jira/browse/SPARK-17187
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Sean Zhong
>
> *Background*
> For aggregation functions like sum and count, Spark-Sql internally use an 
> aggregation buffer to store the intermediate aggregation result for all 
> aggregation functions. Each aggregation function will occupy a section of 
> aggregation buffer.
> *Problem*
> Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
> data types stored in aggregation buffer, which is not very convenient or 
> performant, there are several typical cases like:
> 1. If the aggregation has a complex model CountMinSketch, it is not very easy 
> to convert the complex model so that it can be stored with limited Spark-sql 
> supported data types.
> 2. It is hard to reuse aggregation class definition defined in existing 
> libraries like algebird.
> 3. It may introduces heavy serialization/deserializat

[jira] [Updated] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object

2016-08-22 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17187:
---
Description: 
*Background*

For aggregation functions like sum and count, Spark-Sql internally use an 
aggregation buffer to store the intermediate aggregation result for all 
aggregation functions. Each aggregation function will occupy a section of 
aggregation buffer.

*Problem*

Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
data types stored in aggregation buffer, which is not very convenient or 
performant, there are several typical cases like:

1. If the aggregation has a complex model CountMinSketch, it is not very easy 
to convert the complex model so that it can be stored with limited Spark-sql 
supported data types.

2. It is hard to reuse aggregation class definition defined in existing 
libraries like algebird.

3. It may introduces heavy serialization/deserialization cost when converting a 
domain model to Spark sql supported data type. For example, the current 
implementation of `TypedAggregateExpression` requires 
serialization/de-serialization for each call of update or merge.

*Proposal*

We propose:

1. Introduces a TypedImperativeAggregate which allows using arbitrary java 
object as aggregation buffer, with requirements like:
 -  It is flexible enough that the API allows using any java object as 
aggregation buffer, so that it is easier to integrate with existing Monoid 
libraries like algebird.
 -  We don't need to call serialize/deserialize for each call of 
update/merge. Instead, only a few serialization/deserialization operations are 
needed. This is to guarantee theperformance.
2.  Refactors `TypedAggregateExpression` to use this new interface, to get 
higher performance.
3. Implements Appro-Percentile and other aggregation functions which has a 
complex aggregation object with this new interface.

  was:
*Background*

For aggregation functions like sum and count, Spark-Sql internally use an 
aggregation buffer to store the intermediate aggregation result for all 
aggregation functions. Each aggregation function will occupy a section of 
aggregation buffer.

*Problem*

Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
data types stored in aggregation buffer, which is not very convenient or 
performant, there are several typical cases like:

1. If the aggregation has a complex model CountMinSketch, it is not very easy 
to convert the complex model so that it can be stored with limited Spark-sql 
supported data types.
2. It is hard to reuse aggregation class definition defined in existing 
libraries like algebird.
3. It may introduces heavy serialization/deserialization cost when converting a 
domain model to Spark sql supported data type. For example, the current 
implementation of `TypedAggregateExpression` requires 
serialization/de-serialization for each call of update or merge.

*Proposal*

We propose:
1. Introduces a TypedImperativeAggregate which allows using arbitrary java 
object as aggregation buffer, with requirements like:
a. It is flexible enough that the API allows using any java object as 
aggregation buffer, so that it is easier to integrate with existing Monoid 
libraries like algebird.
b. We don't need to call serialize/deserialize for each call of 
update/merge. Instead, only a few serialization/deserialization operations are 
needed. This is to guarantee the performance.
2.  Refactors `TypedAggregateExpression` to use this new interface, to get 
higher performance.
3. Implements Appro-Percentile and other aggregation functions which has a 
complex aggregation object with this new interface.


> Support using arbitrary Java object as internal aggregation buffer object
> -
>
> Key: SPARK-17187
> URL: https://issues.apache.org/jira/browse/SPARK-17187
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Sean Zhong
>
> *Background*
> For aggregation functions like sum and count, Spark-Sql internally use an 
> aggregation buffer to store the intermediate aggregation result for all 
> aggregation functions. Each aggregation function will occupy a section of 
> aggregation buffer.
> *Problem*
> Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
> data types stored in aggregation buffer, which is not very convenient or 
> performant, there are several typical cases like:
> 1. If the aggregation has a complex model CountMinSketch, it is not very easy 
> to convert the complex model so that it can be stored with limited Spark-sql 
> supported data types.
> 2. It is hard to reuse aggregation class definition defined in existing 
> libraries like algebird.
> 3. It may introduces heavy serialization/deserialization cost w

[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.

2016-08-22 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431018#comment-15431018
 ] 

Andrew Davidson commented on SPARK-17172:
-

Hi Sean

It should be very easy to use the attached notebook to reproduce the hive bug. 
I got the code example from a blog. The original code worked in spark 1.5.x

I also attached an html version of the notebook so you can see the entire stack 
trace with out having to start jupyter

thanks

Andy

> pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while 
> calling None.org.apache.spark.sql.hive.HiveContext. 
> --
>
> Key: SPARK-17172
> URL: https://issues.apache.org/jira/browse/SPARK-17172
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2
> Environment: spark version: 1.6.2
> python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
>Reporter: Andrew Davidson
> Attachments: hiveUDFBug.html, hiveUDFBug.ipynb
>
>
> from pyspark.sql import HiveContext
> sqlContext = HiveContext(sc)
> # Define udf
> from pyspark.sql.functions import udf
> def scoreToCategory(score):
> if score >= 80: return 'A'
> elif score >= 60: return 'B'
> elif score >= 35: return 'C'
> else: return 'D'
>  
> udfScoreToCategory=udf(scoreToCategory, StringType())
> throws exception
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.lang.RuntimeException: Unable to 
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.

2016-08-22 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431030#comment-15431030
 ] 

Sean Owen commented on SPARK-17172:
---

It shows this error:

You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt 
assembly

I think that's the cause. Did you build with hive support? (Despite the message 
I think the more direct way to do it is -Phive)

> pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while 
> calling None.org.apache.spark.sql.hive.HiveContext. 
> --
>
> Key: SPARK-17172
> URL: https://issues.apache.org/jira/browse/SPARK-17172
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2
> Environment: spark version: 1.6.2
> python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
>Reporter: Andrew Davidson
> Attachments: hiveUDFBug.html, hiveUDFBug.ipynb
>
>
> from pyspark.sql import HiveContext
> sqlContext = HiveContext(sc)
> # Define udf
> from pyspark.sql.functions import udf
> def scoreToCategory(score):
> if score >= 80: return 'A'
> elif score >= 60: return 'B'
> elif score >= 35: return 'C'
> else: return 'D'
>  
> udfScoreToCategory=udf(scoreToCategory, StringType())
> throws exception
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.lang.RuntimeException: Unable to 
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object

2016-08-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431034#comment-15431034
 ] 

Apache Spark commented on SPARK-17187:
--

User 'clockfly' has created a pull request for this issue:
https://github.com/apache/spark/pull/14753

> Support using arbitrary Java object as internal aggregation buffer object
> -
>
> Key: SPARK-17187
> URL: https://issues.apache.org/jira/browse/SPARK-17187
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Sean Zhong
>
> *Background*
> For aggregation functions like sum and count, Spark-Sql internally use an 
> aggregation buffer to store the intermediate aggregation result for all 
> aggregation functions. Each aggregation function will occupy a section of 
> aggregation buffer.
> *Problem*
> Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
> data types stored in aggregation buffer, which is not very convenient or 
> performant, there are several typical cases like:
> 1. If the aggregation has a complex model CountMinSketch, it is not very easy 
> to convert the complex model so that it can be stored with limited Spark-sql 
> supported data types.
> 2. It is hard to reuse aggregation class definition defined in existing 
> libraries like algebird.
> 3. It may introduces heavy serialization/deserialization cost when converting 
> a domain model to Spark sql supported data type. For example, the current 
> implementation of `TypedAggregateExpression` requires 
> serialization/de-serialization for each call of update or merge.
> *Proposal*
> We propose:
> 1. Introduces a TypedImperativeAggregate which allows using arbitrary java 
> object as aggregation buffer, with requirements like:
>  -  It is flexible enough that the API allows using any java object as 
> aggregation buffer, so that it is easier to integrate with existing Monoid 
> libraries like algebird.
>  -  We don't need to call serialize/deserialize for each call of 
> update/merge. Instead, only a few serialization/deserialization operations 
> are needed. This is to guarantee theperformance.
> 2.  Refactors `TypedAggregateExpression` to use this new interface, to get 
> higher performance.
> 3. Implements Appro-Percentile and other aggregation functions which has a 
> complex aggregation object with this new interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object

2016-08-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17187:


Assignee: (was: Apache Spark)

> Support using arbitrary Java object as internal aggregation buffer object
> -
>
> Key: SPARK-17187
> URL: https://issues.apache.org/jira/browse/SPARK-17187
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Sean Zhong
>
> *Background*
> For aggregation functions like sum and count, Spark-Sql internally use an 
> aggregation buffer to store the intermediate aggregation result for all 
> aggregation functions. Each aggregation function will occupy a section of 
> aggregation buffer.
> *Problem*
> Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
> data types stored in aggregation buffer, which is not very convenient or 
> performant, there are several typical cases like:
> 1. If the aggregation has a complex model CountMinSketch, it is not very easy 
> to convert the complex model so that it can be stored with limited Spark-sql 
> supported data types.
> 2. It is hard to reuse aggregation class definition defined in existing 
> libraries like algebird.
> 3. It may introduces heavy serialization/deserialization cost when converting 
> a domain model to Spark sql supported data type. For example, the current 
> implementation of `TypedAggregateExpression` requires 
> serialization/de-serialization for each call of update or merge.
> *Proposal*
> We propose:
> 1. Introduces a TypedImperativeAggregate which allows using arbitrary java 
> object as aggregation buffer, with requirements like:
>  -  It is flexible enough that the API allows using any java object as 
> aggregation buffer, so that it is easier to integrate with existing Monoid 
> libraries like algebird.
>  -  We don't need to call serialize/deserialize for each call of 
> update/merge. Instead, only a few serialization/deserialization operations 
> are needed. This is to guarantee theperformance.
> 2.  Refactors `TypedAggregateExpression` to use this new interface, to get 
> higher performance.
> 3. Implements Appro-Percentile and other aggregation functions which has a 
> complex aggregation object with this new interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17164) Query with colon in the table name fails to parse in 2.0

2016-08-22 Thread Sital Kedia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431031#comment-15431031
 ] 

Sital Kedia commented on SPARK-17164:
-

Thanks [~rxin], [~hvanhovell], that makes sense. The issue is on our side. For 
Spark 1.6 we were using our internal version of Hive parser which supports 
table name with colon. Since Spark 2.0, no longer uses the Hive parser, this 
broke on our side. We will use backtick for such tables to workaround this 
issue.

> Query with colon in the table name fails to parse in 2.0
> 
>
> Key: SPARK-17164
> URL: https://issues.apache.org/jira/browse/SPARK-17164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> Running a simple query with colon in table name fails to parse in 2.0
> {code}
> == SQL ==
> SELECT * FROM a:b
> ---^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682)
>   ... 48 elided
> {code}
> Please note that this is a regression from Spark 1.6 as the query runs fine 
> in 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-17164) Query with colon in the table name fails to parse in 2.0

2016-08-22 Thread Sital Kedia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia closed SPARK-17164.
---
Resolution: Won't Fix

> Query with colon in the table name fails to parse in 2.0
> 
>
> Key: SPARK-17164
> URL: https://issues.apache.org/jira/browse/SPARK-17164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> Running a simple query with colon in table name fails to parse in 2.0
> {code}
> == SQL ==
> SELECT * FROM a:b
> ---^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682)
>   ... 48 elided
> {code}
> Please note that this is a regression from Spark 1.6 as the query runs fine 
> in 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object

2016-08-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17187:


Assignee: Apache Spark

> Support using arbitrary Java object as internal aggregation buffer object
> -
>
> Key: SPARK-17187
> URL: https://issues.apache.org/jira/browse/SPARK-17187
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Sean Zhong
>Assignee: Apache Spark
>
> *Background*
> For aggregation functions like sum and count, Spark-Sql internally use an 
> aggregation buffer to store the intermediate aggregation result for all 
> aggregation functions. Each aggregation function will occupy a section of 
> aggregation buffer.
> *Problem*
> Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
> data types stored in aggregation buffer, which is not very convenient or 
> performant, there are several typical cases like:
> 1. If the aggregation has a complex model CountMinSketch, it is not very easy 
> to convert the complex model so that it can be stored with limited Spark-sql 
> supported data types.
> 2. It is hard to reuse aggregation class definition defined in existing 
> libraries like algebird.
> 3. It may introduces heavy serialization/deserialization cost when converting 
> a domain model to Spark sql supported data type. For example, the current 
> implementation of `TypedAggregateExpression` requires 
> serialization/de-serialization for each call of update or merge.
> *Proposal*
> We propose:
> 1. Introduces a TypedImperativeAggregate which allows using arbitrary java 
> object as aggregation buffer, with requirements like:
>  -  It is flexible enough that the API allows using any java object as 
> aggregation buffer, so that it is easier to integrate with existing Monoid 
> libraries like algebird.
>  -  We don't need to call serialize/deserialize for each call of 
> update/merge. Instead, only a few serialization/deserialization operations 
> are needed. This is to guarantee theperformance.
> 2.  Refactors `TypedAggregateExpression` to use this new interface, to get 
> higher performance.
> 3. Implements Appro-Percentile and other aggregation functions which has a 
> complex aggregation object with this new interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16593) Provide a pre-fetch mechanism to accelerate shuffle stage.

2016-08-22 Thread Biao Ma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431070#comment-15431070
 ] 

Biao Ma commented on SPARK-16593:
-

I had made new commits.

> Provide a pre-fetch mechanism to accelerate shuffle stage.
> --
>
> Key: SPARK-16593
> URL: https://issues.apache.org/jira/browse/SPARK-16593
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Biao Ma
>Priority: Minor
>  Labels: features
>
> Currently, the `NettyBlockRpcServer` will reading data through BlockManager, 
> while the block is not cached in memory, the data should be read from DISK 
> first, then into MEM. I wonder if we implement a mechanism add a message 
> contains the blockIds that the same as the openBlock message but one loop 
> ahead, then the `NettyBlockRpcServer ` will load the block ready for transfer 
> to  the reduce side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17188) Moves QuantileSummaries to project catalyst from sql so that it can be used to implement percentile_approx

2016-08-22 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17188:
---
Description: org.apache.spark.sql.execution.stat

> Moves QuantileSummaries to project catalyst from sql so that it can be used 
> to implement percentile_approx
> --
>
> Key: SPARK-17188
> URL: https://issues.apache.org/jira/browse/SPARK-17188
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sean Zhong
>
> org.apache.spark.sql.execution.stat



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17188) Moves QuantileSummaries to project catalyst from sql so that it can be used to implement percentile_approx

2016-08-22 Thread Sean Zhong (JIRA)

Sean Zhong created SPARK-17188:
--

 Summary: Moves QuantileSummaries to project catalyst from sql so 
that it can be used to implement percentile_approx
 Key: SPARK-17188
 URL: https://issues.apache.org/jira/browse/SPARK-17188
 Project: Spark
  Issue Type: Sub-task
Reporter: Sean Zhong






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17188) Moves QuantileSummaries to project catalyst from sql so that it can be used to implement percentile_approx

2016-08-22 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17188:
---
Description: 
QuantileSummaries is a useful utility class to do statistics. It can be used by 
aggregation function like percentile_approx.

Currently, QuantileSummaries is located at project catalyst in package 
org.apache.spark.sql.execution.stat, probably, we should move it to project 
catalyst package org.apache.spark.sql.util.

  was:org.apache.spark.sql.execution.stat


> Moves QuantileSummaries to project catalyst from sql so that it can be used 
> to implement percentile_approx
> --
>
> Key: SPARK-17188
> URL: https://issues.apache.org/jira/browse/SPARK-17188
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sean Zhong
>
> QuantileSummaries is a useful utility class to do statistics. It can be used 
> by aggregation function like percentile_approx.
> Currently, QuantileSummaries is located at project catalyst in package 
> org.apache.spark.sql.execution.stat, probably, we should move it to project 
> catalyst package org.apache.spark.sql.util.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16283) Implement percentile_approx SQL function

2016-08-22 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431136#comment-15431136
 ] 

Sean Zhong commented on SPARK-16283:


Created a sub-task to move QuantileSummaries to package 
org.apache.spark.sql.util of catalyst project

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16283) Implement percentile_approx SQL function

2016-08-22 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431136#comment-15431136
 ] 

Sean Zhong edited comment on SPARK-16283 at 8/22/16 4:35 PM:
-

Created a sub-task SPARK-17188 to move QuantileSummaries to package 
org.apache.spark.sql.util of catalyst project


was (Author: clockfly):
Created a sub-task to move QuantileSummaries to package 
org.apache.spark.sql.util of catalyst project

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17110) Pyspark with locality ANY throw java.io.StreamCorruptedException

2016-08-22 Thread Jonathan Alvarado (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431139#comment-15431139
 ] 

Jonathan Alvarado commented on SPARK-17110:
---

Is there a workaround for this issue?  I'm affected by it.

> Pyspark with locality ANY throw java.io.StreamCorruptedException
> 
>
> Key: SPARK-17110
> URL: https://issues.apache.org/jira/browse/SPARK-17110
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: Cluster of 2 AWS r3.xlarge nodes launched via ec2 
> scripts, Spark 2.0.0, hadoop: yarn, pyspark shell
>Reporter: Tomer Kaftan
>Priority: Critical
>
> In Pyspark 2.0.0, any task that accesses cached data non-locally throws a 
> StreamCorruptedException like the stacktrace below:
> {noformat}
> WARN TaskSetManager: Lost task 7.0 in stage 2.0 (TID 26, 172.31.26.184): 
> java.io.StreamCorruptedException: invalid stream header: 12010A80
> at 
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:807)
> at java.io.ObjectInputStream.(ObjectInputStream.java:302)
> at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:63)
> at 
> org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:63)
> at 
> org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:122)
> at 
> org.apache.spark.serializer.SerializerManager.dataDeserializeStream(SerializerManager.scala:146)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$getRemoteValues$1.apply(BlockManager.scala:524)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$getRemoteValues$1.apply(BlockManager.scala:522)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:522)
> at org.apache.spark.storage.BlockManager.get(BlockManager.scala:609)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:661)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The simplest way I have found to reproduce this is by running the following 
> code in the pyspark shell, on a cluster of 2 nodes set to use only one worker 
> core each:
> {code}
> x = sc.parallelize([1, 1, 1, 1, 1, 1000, 1, 1, 1], numSlices=9).cache()
> x.count()
> import time
> def waitMap(x):
> time.sleep(x)
> return x
> x.map(waitMap).count()
> {code}
> Or by running the following via spark-submit:
> {code}
> from pyspark import SparkContext
> sc = SparkContext()
> x = sc.parallelize([1, 1, 1, 1, 1, 1000, 1, 1, 1], numSlices=9).cache()
> x.count()
> import time
> def waitMap(x):
> time.sleep(x)
> return x
> x.map(waitMap).count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17188) Moves QuantileSummaries to project catalyst from sql so that it can be used to implement percentile_approx

2016-08-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431179#comment-15431179
 ] 

Apache Spark commented on SPARK-17188:
--

User 'clockfly' has created a pull request for this issue:
https://github.com/apache/spark/pull/14754

> Moves QuantileSummaries to project catalyst from sql so that it can be used 
> to implement percentile_approx
> --
>
> Key: SPARK-17188
> URL: https://issues.apache.org/jira/browse/SPARK-17188
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sean Zhong
>
> QuantileSummaries is a useful utility class to do statistics. It can be used 
> by aggregation function like percentile_approx.
> Currently, QuantileSummaries is located at project catalyst in package 
> org.apache.spark.sql.execution.stat, probably, we should move it to project 
> catalyst package org.apache.spark.sql.util.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17188) Moves QuantileSummaries to project catalyst from sql so that it can be used to implement percentile_approx

2016-08-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17188:


Assignee: (was: Apache Spark)

> Moves QuantileSummaries to project catalyst from sql so that it can be used 
> to implement percentile_approx
> --
>
> Key: SPARK-17188
> URL: https://issues.apache.org/jira/browse/SPARK-17188
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sean Zhong
>
> QuantileSummaries is a useful utility class to do statistics. It can be used 
> by aggregation function like percentile_approx.
> Currently, QuantileSummaries is located at project catalyst in package 
> org.apache.spark.sql.execution.stat, probably, we should move it to project 
> catalyst package org.apache.spark.sql.util.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17188) Moves QuantileSummaries to project catalyst from sql so that it can be used to implement percentile_approx

2016-08-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17188:


Assignee: Apache Spark

> Moves QuantileSummaries to project catalyst from sql so that it can be used 
> to implement percentile_approx
> --
>
> Key: SPARK-17188
> URL: https://issues.apache.org/jira/browse/SPARK-17188
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sean Zhong
>Assignee: Apache Spark
>
> QuantileSummaries is a useful utility class to do statistics. It can be used 
> by aggregation function like percentile_approx.
> Currently, QuantileSummaries is located at project catalyst in package 
> org.apache.spark.sql.execution.stat, probably, we should move it to project 
> catalyst package org.apache.spark.sql.util.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17189) [MINOR] Looses the interface from UnsafeRow to InternalRow in AggregationIterator if UnsafeRow specific method is not used

2016-08-22 Thread Sean Zhong (JIRA)

Sean Zhong created SPARK-17189:
--

 Summary: [MINOR] Looses the interface from UnsafeRow to 
InternalRow in AggregationIterator if UnsafeRow specific method is not used
 Key: SPARK-17189
 URL: https://issues.apache.org/jira/browse/SPARK-17189
 Project: Spark
  Issue Type: Bug
Reporter: Sean Zhong
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 201 matches

Mail list logo