[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR
[ https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430217#comment-15430217 ] DjvuLee commented on SPARK-3630: How much data do you test? we encounter this error in our production. Our data is about several TB. The Spark version is 1.6.1, and the snappy version is 1.1.2.4。 > Identify cause of Kryo+Snappy PARSING_ERROR > --- > > Key: SPARK-3630 > URL: https://issues.apache.org/jira/browse/SPARK-3630 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: Andrew Ash >Assignee: Josh Rosen > > A recent GraphX commit caused non-deterministic exceptions in unit tests so > it was reverted (see SPARK-3400). > Separately, [~aash] observed the same exception stacktrace in an > application-specific Kryo registrator: > {noformat} > com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to > uncompress the chunk: PARSING_ERROR(2) > com.esotericsoftware.kryo.io.Input.fill(Input.java:142) > com.esotericsoftware.kryo.io.Input.require(Input.java:169) > com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) > com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) > com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127) > > com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117) > > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109) > > com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) > > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > ... > {noformat} > This ticket is to identify the cause of the exception in the GraphX commit so > the faulty commit can be fixed and merged back into master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR
[ https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430215#comment-15430215 ] DjvuLee commented on SPARK-3630: How much data do you test? we encounter this error in our production. Our data is about several TB. The Spark version is 1.6.1, and the snappy version is 1.1.2.4。 > Identify cause of Kryo+Snappy PARSING_ERROR > --- > > Key: SPARK-3630 > URL: https://issues.apache.org/jira/browse/SPARK-3630 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: Andrew Ash >Assignee: Josh Rosen > > A recent GraphX commit caused non-deterministic exceptions in unit tests so > it was reverted (see SPARK-3400). > Separately, [~aash] observed the same exception stacktrace in an > application-specific Kryo registrator: > {noformat} > com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to > uncompress the chunk: PARSING_ERROR(2) > com.esotericsoftware.kryo.io.Input.fill(Input.java:142) > com.esotericsoftware.kryo.io.Input.require(Input.java:169) > com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) > com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) > com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127) > > com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117) > > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109) > > com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) > > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > ... > {noformat} > This ticket is to identify the cause of the exception in the GraphX commit so > the faulty commit can be fixed and merged back into master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR
[ https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430215#comment-15430215 ] DjvuLee edited comment on SPARK-3630 at 8/22/16 7:10 AM: - Can I know how much data do you test? We encounter this error in our production, our data is about several TB. The Spark version is 1.6.1, and the snappy version is 1.1.2.4。When the data is small, we never encounter this error. was (Author: djvulee): How much data do you test? we encounter this error in our production. Our data is about several TB. The Spark version is 1.6.1, and the snappy version is 1.1.2.4。 > Identify cause of Kryo+Snappy PARSING_ERROR > --- > > Key: SPARK-3630 > URL: https://issues.apache.org/jira/browse/SPARK-3630 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: Andrew Ash >Assignee: Josh Rosen > > A recent GraphX commit caused non-deterministic exceptions in unit tests so > it was reverted (see SPARK-3400). > Separately, [~aash] observed the same exception stacktrace in an > application-specific Kryo registrator: > {noformat} > com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to > uncompress the chunk: PARSING_ERROR(2) > com.esotericsoftware.kryo.io.Input.fill(Input.java:142) > com.esotericsoftware.kryo.io.Input.require(Input.java:169) > com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) > com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) > com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127) > > com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117) > > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109) > > com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) > > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > ... > {noformat} > This ticket is to identify the cause of the exception in the GraphX commit so > the faulty commit can be fixed and merged back into master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR
[ https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DjvuLee updated SPARK-3630: --- Comment: was deleted (was: How much data do you test? we encounter this error in our production. Our data is about several TB. The Spark version is 1.6.1, and the snappy version is 1.1.2.4。) > Identify cause of Kryo+Snappy PARSING_ERROR > --- > > Key: SPARK-3630 > URL: https://issues.apache.org/jira/browse/SPARK-3630 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: Andrew Ash >Assignee: Josh Rosen > > A recent GraphX commit caused non-deterministic exceptions in unit tests so > it was reverted (see SPARK-3400). > Separately, [~aash] observed the same exception stacktrace in an > application-specific Kryo registrator: > {noformat} > com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to > uncompress the chunk: PARSING_ERROR(2) > com.esotericsoftware.kryo.io.Input.fill(Input.java:142) > com.esotericsoftware.kryo.io.Input.require(Input.java:169) > com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) > com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) > com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127) > > com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117) > > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109) > > com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) > > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > ... > {noformat} > This ticket is to identify the cause of the exception in the GraphX commit so > the faulty commit can be fixed and merged back into master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5770) Use addJar() to upload a new jar file to executor, it can't be added to classloader
[ https://issues.apache.org/jira/browse/SPARK-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430243#comment-15430243 ] marymwu commented on SPARK-5770: Hey, we have ran into the same issue too. We try to fix this but failed. Anybody can help on this issue, thank so much! > Use addJar() to upload a new jar file to executor, it can't be added to > classloader > --- > > Key: SPARK-5770 > URL: https://issues.apache.org/jira/browse/SPARK-5770 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: meiyoula >Priority: Minor > > First use addJar() to upload a jar to the executor, then change the jar > content and upload it again. We can see the jar file in the local has be > updated, but the classloader still load the old one. The executor log has no > error or exception to point it. > I use spark-shell to test it. And set "spark.files.overwrite" is true. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15285) Generated SpecificSafeProjection.apply method grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-15285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-15285. - Resolution: Fixed Fix Version/s: (was: 2.0.0) 2.1.0 2.0.1 > Generated SpecificSafeProjection.apply method grows beyond 64 KB > > > Key: SPARK-15285 > URL: https://issues.apache.org/jira/browse/SPARK-15285 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Konstantin Shaposhnikov >Assignee: Kazuaki Ishizaki > Fix For: 2.0.1, 2.1.0 > > > The following code snippet results in > {noformat} > org.codehaus.janino.JaninoRuntimeException: Code of method > "(Ljava/lang/Object;)Ljava/lang/Object;" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > {noformat} > {code} > case class S100(s1:String="1", s2:String="2", s3:String="3", s4:String="4", > s5:String="5", s6:String="6", s7:String="7", s8:String="8", s9:String="9", > s10:String="10", s11:String="11", s12:String="12", s13:String="13", > s14:String="14", s15:String="15", s16:String="16", s17:String="17", > s18:String="18", s19:String="19", s20:String="20", s21:String="21", > s22:String="22", s23:String="23", s24:String="24", s25:String="25", > s26:String="26", s27:String="27", s28:String="28", s29:String="29", > s30:String="30", s31:String="31", s32:String="32", s33:String="33", > s34:String="34", s35:String="35", s36:String="36", s37:String="37", > s38:String="38", s39:String="39", s40:String="40", s41:String="41", > s42:String="42", s43:String="43", s44:String="44", s45:String="45", > s46:String="46", s47:String="47", s48:String="48", s49:String="49", > s50:String="50", s51:String="51", s52:String="52", s53:String="53", > s54:String="54", s55:String="55", s56:String="56", s57:String="57", > s58:String="58", s59:String="59", s60:String="60", s61:String="61", > s62:String="62", s63:String="63", s64:String="64", s65:String="65", > s66:String="66", s67:String="67", s68:String="68", s69:String="69", > s70:String="70", s71:String="71", s72:String="72", s73:String="73", > s74:String="74", s75:String="75", s76:String="76", s77:String="77", > s78:String="78", s79:String="79", s80:String="80", s81:String="81", > s82:String="82", s83:String="83", s84:String="84", s85:String="85", > s86:String="86", s87:String="87", s88:String="88", s89:String="89", > s90:String="90", s91:String="91", s92:String="92", s93:String="93", > s94:String="94", s95:String="95", s96:String="96", s97:String="97", > s98:String="98", s99:String="99", s100:String="100") > case class S(s1: S100=S100(), s2: S100=S100(), s3: S100=S100(), s4: > S100=S100(), s5: S100=S100(), s6: S100=S100(), s7: S100=S100(), s8: > S100=S100(), s9: S100=S100(), s10: S100=S100()) > val ds = Seq(S(),S(),S()).toDS > ds.show() > {code} > I could reproduce this with Spark built from 1.6 branch and with > https://home.apache.org/~pwendell/spark-nightly/spark-master-bin/spark-2.0.0-SNAPSHOT-2016_05_11_01_03-8beae59-bin/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17168) CSV with header is incorrectly read if file is partitioned
[ https://issues.apache.org/jira/browse/SPARK-17168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430258#comment-15430258 ] Sean Owen commented on SPARK-17168: --- It's a tough call. I can imagine for example a process ingesting lines of a huge CSV file and outputting them after some generic transformation. One file, with one header, may become many files ... of which only the first has a header. It's unclear whether that or having headers in every file is 'normal'. I'm not sure it's easy to implement, but I could imagine skipping the first line of any file that matches the first line of the first file. > CSV with header is incorrectly read if file is partitioned > -- > > Key: SPARK-17168 > URL: https://issues.apache.org/jira/browse/SPARK-17168 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Mathieu D >Priority: Minor > > If a CSV file is stored in a partitioned fashion, the DataframeReader.csv > with option header set to true skips the first line of *each partition* > instead of skipping only the first one. > ex: > {code} > // create a partitioned CSV file with header : > val rdd=sc.parallelize(Seq("hdr","1","2","3","4","5","6"), numSlices=2) > rdd.saveAsTextFile("foo") > {code} > Now, if we try to read it with DataframeReader, the first row of the 2nd > partition is skipped. > {code} > val df = spark.read.option("header","true").csv("foo") > df.show > +---+ > |hdr| > +---+ > | 1| > | 2| > | 4| > | 5| > | 6| > +---+ > // one row is missing > {code} > I more or less understand that this is to be consistent with the save > operation of dataframewriter which saves header on each individual partition. > But this is very error-prone. In our case, we have large CSV files with > headers already stored in a partitioned way, so we will lose rows if we read > with header set to true. So we have to manually handle the headers. > I suggest a tri-valued option for header, with something like > "skipOnFirstPartition" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17090) Make tree aggregation level in linear/logistic regression configurable
[ https://issues.apache.org/jira/browse/SPARK-17090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430266#comment-15430266 ] Apache Spark commented on SPARK-17090: -- User 'hqzizania' has created a pull request for this issue: https://github.com/apache/spark/pull/14738 > Make tree aggregation level in linear/logistic regression configurable > -- > > Key: SPARK-17090 > URL: https://issues.apache.org/jira/browse/SPARK-17090 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Priority: Minor > Fix For: 2.1.0 > > > Linear/logistic regression use treeAggregate with default aggregation depth > for collecting coefficient gradient updates to the driver. For high > dimensional problems, this can case OOM error on the driver. We should make > it configurable, perhaps via an expert param, so that users can avoid this > problem if their data has many features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17127) Include AArch64 in the check of cached unaligned-access capability
[ https://issues.apache.org/jira/browse/SPARK-17127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17127. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14700 [https://github.com/apache/spark/pull/14700] > Include AArch64 in the check of cached unaligned-access capability > -- > > Key: SPARK-17127 > URL: https://issues.apache.org/jira/browse/SPARK-17127 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 > Environment: AArch64 >Reporter: Richael Zhuang >Priority: Minor > Fix For: 2.1.0 > > > From the spark of version 2.0.0 , when MemoryMode.OFF_HEAP is set , whether > the architecture supports unaligned access or not is checked. If the check > doesn't pass, exception is raised. > We know that AArch64 also supports unaligned access , but now only i386, x86, > amd64, and X86_64 are included. > I think we should include aarch64 when performing the check. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17127) Include AArch64 in the check of cached unaligned-access capability
[ https://issues.apache.org/jira/browse/SPARK-17127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17127: -- Assignee: Richael Zhuang > Include AArch64 in the check of cached unaligned-access capability > -- > > Key: SPARK-17127 > URL: https://issues.apache.org/jira/browse/SPARK-17127 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 > Environment: AArch64 >Reporter: Richael Zhuang >Assignee: Richael Zhuang >Priority: Minor > Fix For: 2.1.0 > > > From the spark of version 2.0.0 , when MemoryMode.OFF_HEAP is set , whether > the architecture supports unaligned access or not is checked. If the check > doesn't pass, exception is raised. > We know that AArch64 also supports unaligned access , but now only i386, x86, > amd64, and X86_64 are included. > I think we should include aarch64 when performing the check. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data
[ https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430270#comment-15430270 ] Apache Spark commented on SPARK-17086: -- User 'VinceShieh' has created a pull request for this issue: https://github.com/apache/spark/pull/14747 > QuantileDiscretizer throws InvalidArgumentException (parameter splits given > invalid value) on valid data > > > Key: SPARK-17086 > URL: https://issues.apache.org/jira/browse/SPARK-17086 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker > > I discovered this bug when working with a build from the master branch (which > I believe is 2.1.0). This used to work fine when running spark 1.6.2. > I have a dataframe with an "intData" column that has values like > {code} > 1 3 2 1 1 2 3 2 2 2 1 3 > {code} > I have a stage in my pipeline that uses the QuantileDiscretizer to produce > equal weight splits like this > {code} > new QuantileDiscretizer() > .setInputCol("intData") > .setOutputCol("intData_bin") > .setNumBuckets(10) > .fit(df) > {code} > But when that gets run it (incorrectly) throws this error: > {code} > parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, > 3.0, Infinity] > {code} > I don't think that there should be duplicate splits generated should there be? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data
[ https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17086: Assignee: (was: Apache Spark) > QuantileDiscretizer throws InvalidArgumentException (parameter splits given > invalid value) on valid data > > > Key: SPARK-17086 > URL: https://issues.apache.org/jira/browse/SPARK-17086 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker > > I discovered this bug when working with a build from the master branch (which > I believe is 2.1.0). This used to work fine when running spark 1.6.2. > I have a dataframe with an "intData" column that has values like > {code} > 1 3 2 1 1 2 3 2 2 2 1 3 > {code} > I have a stage in my pipeline that uses the QuantileDiscretizer to produce > equal weight splits like this > {code} > new QuantileDiscretizer() > .setInputCol("intData") > .setOutputCol("intData_bin") > .setNumBuckets(10) > .fit(df) > {code} > But when that gets run it (incorrectly) throws this error: > {code} > parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, > 3.0, Infinity] > {code} > I don't think that there should be duplicate splits generated should there be? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data
[ https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17086: Assignee: Apache Spark > QuantileDiscretizer throws InvalidArgumentException (parameter splits given > invalid value) on valid data > > > Key: SPARK-17086 > URL: https://issues.apache.org/jira/browse/SPARK-17086 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker >Assignee: Apache Spark > > I discovered this bug when working with a build from the master branch (which > I believe is 2.1.0). This used to work fine when running spark 1.6.2. > I have a dataframe with an "intData" column that has values like > {code} > 1 3 2 1 1 2 3 2 2 2 1 3 > {code} > I have a stage in my pipeline that uses the QuantileDiscretizer to produce > equal weight splits like this > {code} > new QuantileDiscretizer() > .setInputCol("intData") > .setOutputCol("intData_bin") > .setNumBuckets(10) > .fit(df) > {code} > But when that gets run it (incorrectly) throws this error: > {code} > parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, > 3.0, Infinity] > {code} > I don't think that there should be duplicate splits generated should there be? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15285) Generated SpecificSafeProjection.apply method grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-15285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430273#comment-15430273 ] Dongjoon Hyun commented on SPARK-15285: --- Thank you! > Generated SpecificSafeProjection.apply method grows beyond 64 KB > > > Key: SPARK-15285 > URL: https://issues.apache.org/jira/browse/SPARK-15285 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Konstantin Shaposhnikov >Assignee: Kazuaki Ishizaki > Fix For: 2.0.1, 2.1.0 > > > The following code snippet results in > {noformat} > org.codehaus.janino.JaninoRuntimeException: Code of method > "(Ljava/lang/Object;)Ljava/lang/Object;" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > {noformat} > {code} > case class S100(s1:String="1", s2:String="2", s3:String="3", s4:String="4", > s5:String="5", s6:String="6", s7:String="7", s8:String="8", s9:String="9", > s10:String="10", s11:String="11", s12:String="12", s13:String="13", > s14:String="14", s15:String="15", s16:String="16", s17:String="17", > s18:String="18", s19:String="19", s20:String="20", s21:String="21", > s22:String="22", s23:String="23", s24:String="24", s25:String="25", > s26:String="26", s27:String="27", s28:String="28", s29:String="29", > s30:String="30", s31:String="31", s32:String="32", s33:String="33", > s34:String="34", s35:String="35", s36:String="36", s37:String="37", > s38:String="38", s39:String="39", s40:String="40", s41:String="41", > s42:String="42", s43:String="43", s44:String="44", s45:String="45", > s46:String="46", s47:String="47", s48:String="48", s49:String="49", > s50:String="50", s51:String="51", s52:String="52", s53:String="53", > s54:String="54", s55:String="55", s56:String="56", s57:String="57", > s58:String="58", s59:String="59", s60:String="60", s61:String="61", > s62:String="62", s63:String="63", s64:String="64", s65:String="65", > s66:String="66", s67:String="67", s68:String="68", s69:String="69", > s70:String="70", s71:String="71", s72:String="72", s73:String="73", > s74:String="74", s75:String="75", s76:String="76", s77:String="77", > s78:String="78", s79:String="79", s80:String="80", s81:String="81", > s82:String="82", s83:String="83", s84:String="84", s85:String="85", > s86:String="86", s87:String="87", s88:String="88", s89:String="89", > s90:String="90", s91:String="91", s92:String="92", s93:String="93", > s94:String="94", s95:String="95", s96:String="96", s97:String="97", > s98:String="98", s99:String="99", s100:String="100") > case class S(s1: S100=S100(), s2: S100=S100(), s3: S100=S100(), s4: > S100=S100(), s5: S100=S100(), s6: S100=S100(), s7: S100=S100(), s8: > S100=S100(), s9: S100=S100(), s10: S100=S100()) > val ds = Seq(S(),S(),S()).toDS > ds.show() > {code} > I could reproduce this with Spark built from 1.6 branch and with > https://home.apache.org/~pwendell/spark-nightly/spark-master-bin/spark-2.0.0-SNAPSHOT-2016_05_11_01_03-8beae59-bin/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
[ https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17172. --- Resolution: Duplicate That sounds like exactly the same issue. You're missing /tmp or can't see it because of permissions issue. Both sound like an environment problem, because this dir exists and is world-writable after setting up any HDFS namenode. At the least, let's merge into your other issue. > pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while > calling None.org.apache.spark.sql.hive.HiveContext. > -- > > Key: SPARK-17172 > URL: https://issues.apache.org/jira/browse/SPARK-17172 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.2 > Environment: spark version: 1.6.2 > python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct 5 2014, 20:42:22) > [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] >Reporter: Andrew Davidson > Attachments: hiveUDFBug.html, hiveUDFBug.ipynb > > > from pyspark.sql import HiveContext > sqlContext = HiveContext(sc) > # Define udf > from pyspark.sql.functions import udf > def scoreToCategory(score): > if score >= 80: return 'A' > elif score >= 60: return 'B' > elif score >= 35: return 'C' > else: return 'D' > > udfScoreToCategory=udf(scoreToCategory, StringType()) > throws exception > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: java.lang.RuntimeException: Unable to > instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp
[ https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430278#comment-15430278 ] Sean Owen commented on SPARK-17143: --- This sounds like an HDFS environment problem then. This dir would exist and be writable to all users when HDFS's file system is created. > pyspark unable to create UDF: java.lang.RuntimeException: > org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a > directory: /tmp tmp > --- > > Key: SPARK-17143 > URL: https://issues.apache.org/jira/browse/SPARK-17143 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 > Environment: spark version: 1.6.1 > python version: 3.4.3 (default, Apr 1 2015, 18:10:40) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] >Reporter: Andrew Davidson > Attachments: udfBug.html, udfBug.ipynb > > > For unknown reason I can not create UDF when I run the attached notebook on > my cluster. I get the following error > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: > org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a > directory: /tmp tmp > The notebook runs fine on my Mac > In general I am able to run non UDF spark code with out any trouble > I start the notebook server as the user “ec2-user" and uses master URL > spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066 > I found the following message in the notebook server log file. I have log > level set to warn > 16/08/18 21:38:45 WARN ObjectStore: Version information not found in > metastore. hive.metastore.schema.verification is not enabled so recording the > schema version 1.2.0 > 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning > NoSuchObjectException > The cluster was originally created using > spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 > #from pyspark.sql import SQLContext, HiveContext > #sqlContext = SQLContext(sc) > > #from pyspark.sql import DataFrame > #from pyspark.sql import functions > > from pyspark.sql.types import StringType > from pyspark.sql.functions import udf > > print("spark version: {}".format(sc.version)) > > import sys > print("python version: {}".format(sys.version)) > spark version: 1.6.1 > python version: 3.4.3 (default, Apr 1 2015, 18:10:40) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] > # functions.lower() raises > # py4j.Py4JException: Method lower([class java.lang.String]) does not exist > # work around define a UDF > toLowerUDFRetType = StringType() > #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > toLowerUDF = udf(lambda s : s.lower(), StringType()) > You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt > assembly > Py4JJavaErrorTraceback (most recent call last) > in () > 4 toLowerUDFRetType = StringType() > 5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > > 6 toLowerUDF = udf(lambda s : s.lower(), StringType()) > /root/spark/python/pyspark/sql/functions.py in udf(f, returnType) >1595 [Row(slen=5), Row(slen=3)] >1596 """ > -> 1597 return UserDefinedFunction(f, returnType) >1598 >1599 blacklist = ['map', 'since', 'ignore_unicode_prefix'] > /root/spark/python/pyspark/sql/functions.py in __init__(self, func, > returnType, name) >1556 self.returnType = returnType >1557 self._broadcast = None > -> 1558 self._judf = self._create_judf(name) >1559 >1560 def _create_judf(self, name): > /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name) >1567 pickled_command, broadcast_vars, env, includes = > _prepare_for_python_RDD(sc, command, self) >1568 ctx = SQLContext.getOrCreate(sc) > -> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json()) >1570 if name is None: >1571 name = f.__name__ if hasattr(f, '__name__') else > f.__class__.__name__ > /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self) > 681 try: > 682 if not hasattr(self, '_scala_HiveContext'): > --> 683 self._scala_HiveContext = self._get_hive_ctx() > 684 return self._scala_HiveContext > 685 except Py4JError as e: > /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self) > 690 > 691 def _get_hive_ctx(self): > --> 692 return self._jvm.HiveContext(self._jsc.sc()) > 693 > 694 def refreshTable(self, tableName): > /root/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in > __call__(self, *args) >1062 answer = self._gateway_client.sen
[jira] [Resolved] (SPARK-17115) Improve the performance of UnsafeProjection for wide table
[ https://issues.apache.org/jira/browse/SPARK-17115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-17115. - Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 14692 [https://github.com/apache/spark/pull/14692] > Improve the performance of UnsafeProjection for wide table > -- > > Key: SPARK-17115 > URL: https://issues.apache.org/jira/browse/SPARK-17115 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.1, 2.1.0 > > > We increase the threshold for splitting the generate code for expressions to > 64K in 2.0 by accident (too optimistic), which could cause bad performance > (the huge method may not be JITed), we should decrease that to 16K. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17169) To use scala macros to update code when SharedParamsCodeGen.scala changed
[ https://issues.apache.org/jira/browse/SPARK-17169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430282#comment-15430282 ] Yanbo Liang commented on SPARK-17169: - Meanwhile, it's better we can do compile time code-gen for python params as well, that is to say run {{python _shared_params_code_gen.py > shared.py}} automatically. > To use scala macros to update code when SharedParamsCodeGen.scala changed > - > > Key: SPARK-17169 > URL: https://issues.apache.org/jira/browse/SPARK-17169 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Qian Huang >Priority: Minor > > As commented in the file SharedParamsCodeGen.scala, we have to manually run > build/sbt "mllib/runMain org.apache.spark.ml.param.shared.SharedParamsCodeGen" > to generate and update it. > It could be better to do compile time code-gen for this using scala macros > rather than running the script as described above. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10925) Exception when joining DataFrames
[ https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430290#comment-15430290 ] Alexander Bij commented on SPARK-10925: --- We also encountered this issue using with (HDP 2.4.2.0) *Spark 1.6.1*. Our example: {code} // id, datum are String columns. val mt = hiveContext.sql("SELECT id, datum FROM my_test") // create aggregated dataframe: val my_max = mt.groupBy("id").agg(max("datum")).withColumnRenamed("max(datum)", "datum") // Fails (start with Aggregation-DataFrame) val my_max_mt = my_max.join(mt, my_max("datum") === mt("datum"), "inner") // Works (start with normal-DataFrame) val my_max_mt = mt.join(my_max, my_max("datum") === mt("datum"), "inner") // running these queries as SQL works both ways. {code} > Exception when joining DataFrames > - > > Key: SPARK-10925 > URL: https://issues.apache.org/jira/browse/SPARK-10925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Tested with Spark 1.5.0 and Spark 1.5.1 >Reporter: Alexis Seigneurin > Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala > > > I get an exception when joining a DataFrame with another DataFrame. The > second DataFrame was created by performing an aggregation on the first > DataFrame. > My complete workflow is: > # read the DataFrame > # apply an UDF on column "name" > # apply an UDF on column "surname" > # apply an UDF on column "birthDate" > # aggregate on "name" and re-join with the DF > # aggregate on "surname" and re-join with the DF > If I remove one step, the process completes normally. > Here is the exception: > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved > attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in > operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS > birthDate_cleaned#8]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark
[jira] [Comment Edited] (SPARK-10925) Exception when joining DataFrames
[ https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430290#comment-15430290 ] Alexander Bij edited comment on SPARK-10925 at 8/22/16 8:29 AM: We also encountered this issue using with (HDP 2.4.2.0) *Spark 1.6.1*. Our example: {code} // id, datum are String columns. val mt = hiveContext.sql("SELECT id, datum FROM my_test") // create aggregated dataframe: val my_max = mt.groupBy("id").agg(max("datum")).withColumnRenamed("max(datum)", "datum") // Fails (start with Aggregation-DataFrame) val my_max_mt = my_max.join(mt, my_max("datum") === mt("datum"), "inner") // Works (start with normal-DataFrame) val my_max_mt = mt.join(my_max, my_max("datum") === mt("datum"), "inner") // running these queries as SQL works both ways. {code} {code} // Complaining about the datum#526 which is the 'mt("datum")' in the query: org.apache.spark.sql.AnalysisException: resolved attribute(s) datum#526 missing from id#525,datum#528,id#532,datum#533 in operator !Join Inner, Some((datum#528 = datum#526)); at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:183) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:121) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50) {code} was (Author: abij): We also encountered this issue using with (HDP 2.4.2.0) *Spark 1.6.1*. Our example: {code} // id, datum are String columns. val mt = hiveContext.sql("SELECT id, datum FROM my_test") // create aggregated dataframe: val my_max = mt.groupBy("id").agg(max("datum")).withColumnRenamed("max(datum)", "datum") // Fails (start with Aggregation-DataFrame) val my_max_mt = my_max.join(mt, my_max("datum") === mt("datum"), "inner") // Works (start with normal-DataFrame) val my_max_mt = mt.join(my_max, my_max("datum") === mt("datum"), "inner") // running these queries as SQL works both ways. {code} > Exception when joining DataFrames > - > > Key: SPARK-10925 > URL: https://issues.apache.org/jira/browse/SPARK-10925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Tested with Spark 1.5.0 and Spark 1.5.1 >Reporter: Alexis Seigneurin > Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala > > > I get an exception when joining a DataFrame with another DataFrame. The > second DataFrame was created by performing an aggregation on the first > DataFrame. > My complete workflow is: > # read the DataFrame > # apply an UDF on column "name" > # apply an UDF on column "surname" > # apply an UDF on column "birthDate" > # aggregate on "name" and re-join with the DF > # aggregate on "surname" and re-join with the DF > If I remove one step, the process completes normally. > Here is the exception: > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved > attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in > operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS > birthDate_cleaned#8]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apac
[jira] [Updated] (SPARK-17085) Documentation and actual code differs - Unsupported Operations
[ https://issues.apache.org/jira/browse/SPARK-17085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17085: -- Assignee: Jagadeesan A S > Documentation and actual code differs - Unsupported Operations > -- > > Key: SPARK-17085 > URL: https://issues.apache.org/jira/browse/SPARK-17085 > Project: Spark > Issue Type: Documentation > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Samritti >Assignee: Jagadeesan A S >Priority: Minor > > Spark Stuctured Streaming doc in this link > https://spark.apache.org/docs/2.0.0/structured-streaming-programming-guide.html#unsupported-operations > mentions > >>>"Right outer join with a streaming Dataset on the right is not supported" > but the code here conveys a different/opposite error > https://github.com/apache/spark/blob/5545b791096756b07b3207fb3de13b68b9a37b00/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala#L114 > >>>"Right outer join with a streaming DataFrame/Dataset on the left is " + > "not supported" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17085) Documentation and actual code differs - Unsupported Operations
[ https://issues.apache.org/jira/browse/SPARK-17085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17085. --- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Resolved by https://github.com/apache/spark/pull/14715 > Documentation and actual code differs - Unsupported Operations > -- > > Key: SPARK-17085 > URL: https://issues.apache.org/jira/browse/SPARK-17085 > Project: Spark > Issue Type: Documentation > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Samritti >Assignee: Jagadeesan A S >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > > Spark Stuctured Streaming doc in this link > https://spark.apache.org/docs/2.0.0/structured-streaming-programming-guide.html#unsupported-operations > mentions > >>>"Right outer join with a streaming Dataset on the right is not supported" > but the code here conveys a different/opposite error > https://github.com/apache/spark/blob/5545b791096756b07b3207fb3de13b68b9a37b00/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala#L114 > >>>"Right outer join with a streaming DataFrame/Dataset on the left is " + > "not supported" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10925) Exception when joining DataFrames
[ https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430302#comment-15430302 ] Alexander Bij commented on SPARK-10925: --- Relates to issue SPARK-14948 (Exception joining same DF) > Exception when joining DataFrames > - > > Key: SPARK-10925 > URL: https://issues.apache.org/jira/browse/SPARK-10925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Tested with Spark 1.5.0 and Spark 1.5.1 >Reporter: Alexis Seigneurin > Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala > > > I get an exception when joining a DataFrame with another DataFrame. The > second DataFrame was created by performing an aggregation on the first > DataFrame. > My complete workflow is: > # read the DataFrame > # apply an UDF on column "name" > # apply an UDF on column "surname" > # apply an UDF on column "birthDate" > # aggregate on "name" and re-join with the DF > # aggregate on "surname" and re-join with the DF > If I remove one step, the process completes normally. > Here is the exception: > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved > attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in > operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS > birthDate_cleaned#8]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520) > at TestCase2$.main(TestCase2.scala:51) > at TestCase2.main(TestCase2.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at com.inte
[jira] [Updated] (SPARK-17179) Consider improving partition pruning in HiveMetastoreCatalog
[ https://issues.apache.org/jira/browse/SPARK-17179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated SPARK-17179: --- Affects Version/s: 2.0.0 Priority: Major (was: Critical) Description: Issue: - Create an external table with 1000s of partition - Running simple query with partition details ends up listing all files for caching in ListingFileCatalog. This would turn out to be very slow in cloud based FS access (e.g S3). Even though, ListingFileCatalog supports multi-threading, it would end up unncessarily listing 1000+ files when user is just interested in 1 partition. - This adds up additional overhead in HiveMetastoreCatalog as it queries all partitions in convertToLogicalRelation (metastoreRelation.getHiveQlPartitions()). Partition related details are not passed in here, so ends up overloading hive metastore. - Also even if any partition changes, cache would be dirtied and have to be re-populated. It would be nice to prune the partitions in metastore layer itself, so that few partitions are looked up via FileSystem and only few items are cached. {noformat} "CREATE EXTERNAL TABLE `ca_par_ext`( `customer_id` bigint, `account_id` bigint) PARTITIONED BY ( `effective_date` date) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3a://bucket_details/ca_par'" explain select count(*) from ca_par_ext where effective_date between '2015-12-17' and '2015-12-18'; {noformat} was: Issue: - Create an external table with 1000s of partition - Running simple query with partition details ends up listing all files for caching in ListingFileCatalog. This would turn out to be very slow in cloud based FS access (e.g S3). Even though, ListingFileCatalog supports multi-threading, it would end up unncessarily listing 1000+ files when user is just interested in 1 partition. - This adds up additional overhead in HiveMetastoreCatalog as it queries all partitions in convertToLogicalRelation (metastoreRelation.getHiveQlPartitions()). Partition related details are not passed in here, so ends up overloading hive metastore. - Also even if any partition changes, cache would be dirtied and have to be re-populated. It would be nice to prune the partitions in metastore layer itself, so that few partitions are looked up via FileSystem and only few items are cached. {noformat} "CREATE EXTERNAL TABLE `ca_par_ext`( `customer_id` bigint, `account_id` bigint) PARTITIONED BY ( `effective_date` date) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3a://bucket_details/ca_par'" explain select count(*) from ca_par_ext where effective_date between '2015-12-17' and '2015-12-18'; {noformat} > Consider improving partition pruning in HiveMetastoreCatalog > > > Key: SPARK-17179 > URL: https://issues.apache.org/jira/browse/SPARK-17179 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Rajesh Balamohan > > Issue: > - Create an external table with 1000s of partition > - Running simple query with partition details ends up listing all files for > caching in ListingFileCatalog. This would turn out to be very slow in cloud > based FS access (e.g S3). Even though, ListingFileCatalog supports > multi-threading, it would end up unncessarily listing 1000+ files when user > is just interested in 1 partition. > - This adds up additional overhead in HiveMetastoreCatalog as it queries all > partitions in convertToLogicalRelation > (metastoreRelation.getHiveQlPartitions()). Partition related details > are not passed in here, so ends up overloading hive metastore. > - Also even if any partition changes, cache would be dirtied and have to be > re-populated. It would be nice to prune the partitions in metastore layer > itself, so that few partitions are looked up via FileSystem and only few > items are cached. > {noformat} > "CREATE EXTERNAL TABLE `ca_par_ext`( > `customer_id` bigint, > `account_id` bigint) > PARTITIONED BY ( > `effective_date` date) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > LOCATION > 's3a://bucket_details/ca_par'" > explain select count(*) from ca_par_ext where effect
[jira] [Commented] (SPARK-14948) Exception when joining DataFrames derived form the same DataFrame
[ https://issues.apache.org/jira/browse/SPARK-14948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430312#comment-15430312 ] Alexander Bij commented on SPARK-14948: --- We encountered the same issue with Spark 1.6.1. I have posted a simple Scala example in SPARK-10925 (duplication of this issue). > Exception when joining DataFrames derived form the same DataFrame > - > > Key: SPARK-14948 > URL: https://issues.apache.org/jira/browse/SPARK-14948 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Saurabh Santhosh > > h2. Spark Analyser is throwing the following exception in a specific scenario > : > h2. Exception : > org.apache.spark.sql.AnalysisException: resolved attribute(s) F1#3 missing > from asd#5,F2#4,F1#6,F2#7 in operator !Project [asd#5,F1#3]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > h2. Code : > {code:title=SparkClient.java|borderStyle=solid} > StructField[] fields = new StructField[2]; > fields[0] = new StructField("F1", DataTypes.StringType, true, > Metadata.empty()); > fields[1] = new StructField("F2", DataTypes.StringType, true, > Metadata.empty()); > JavaRDD rdd = > > sparkClient.getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("a", > "b"))); > DataFrame df = sparkClient.getSparkHiveContext().createDataFrame(rdd, new > StructType(fields)); > sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t1"); > DataFrame aliasedDf = sparkClient.getSparkHiveContext().sql("select F1 as > asd, F2 from t1"); > sparkClient.getSparkHiveContext().registerDataFrameAsTable(aliasedDf, > "t2"); > sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t3"); > > DataFrame join = aliasedDf.join(df, > aliasedDf.col("F2").equalTo(df.col("F2")), "inner"); > DataFrame select = join.select(aliasedDf.col("asd"), df.col("F1")); > select.collect(); > {code} > h2. Observations : > * This issue is related to the Data Type of Fields of the initial Data > Frame.(If the Data Type is not String, it will work.) > * It works fine if the data frame is registered as a temporary table and an > sql (select a.asd,b.F1 from t2 a inner join t3 b on a.F2=b.F2) is written. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually
[ https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430314#comment-15430314 ] Sean Owen commented on SPARK-15044: --- The use case here is that someone or something deleted some data from an immutable data set that shouldn't have been deleted? That doesn't sound like a good use case. I understand the argument for trying to proceed to return a meaningful result anyway. But I'm saying there isn't a meaningful result here because the data set conceptually contains data that can no longer be correctly queried. Compare also to the risks of silently (or, with a warning that scrolls by) succeeding in returning an incomplete result. > spark-sql will throw "input path does not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually > - > > Key: SPARK-15044 > URL: https://issues.apache.org/jira/browse/SPARK-15044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: huangyu > > spark-sql will throw "input path not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually.The > situation is as follows: > 1) Create a table "test". "create table test (n string) partitioned by (p > string)" > 2) Load some data into partition(p='1') > 3)Remove the path related to partition(p='1') of table test manually. "hadoop > fs -rmr /warehouse//test/p=1" > 4)Run spark sql, spark-sql -e "select n from test where p='1';" > Then it throws exception: > {code} > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > ./test/p=1 > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > {code} > The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK > I think spark-sql should ignore the path, just like hive or it dose in early > versions, rather than throw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org F
[jira] [Commented] (SPARK-16781) java launched by PySpark as gateway may not be the same java used in the spark environment
[ https://issues.apache.org/jira/browse/SPARK-16781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430341#comment-15430341 ] Apache Spark commented on SPARK-16781: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/14748 > java launched by PySpark as gateway may not be the same java used in the > spark environment > -- > > Key: SPARK-16781 > URL: https://issues.apache.org/jira/browse/SPARK-16781 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.2 >Reporter: Michael Berman > > When launching spark on a system with multiple javas installed, there are a > few options for choosing which JRE to use, setting `JAVA_HOME` being the most > straightforward. > However, when pyspark's internal py4j launches its JavaGateway, it always > invokes `java` directly, without qualification. This means you get whatever > java's first on your path, which is not necessarily the same one in spark's > JAVA_HOME. > This could be seen as a py4j issue, but from their point of view, the fix is > easy: make sure the java you want is first on your path. I can't figure out a > way to make that reliably happen through the pyspark executor launch path, > and it seems like something that would ideally happen automatically. If I set > JAVA_HOME when launching spark, I would expect that to be the only java used > throughout the stack. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16781) java launched by PySpark as gateway may not be the same java used in the spark environment
[ https://issues.apache.org/jira/browse/SPARK-16781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16781: Assignee: Apache Spark > java launched by PySpark as gateway may not be the same java used in the > spark environment > -- > > Key: SPARK-16781 > URL: https://issues.apache.org/jira/browse/SPARK-16781 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.2 >Reporter: Michael Berman >Assignee: Apache Spark > > When launching spark on a system with multiple javas installed, there are a > few options for choosing which JRE to use, setting `JAVA_HOME` being the most > straightforward. > However, when pyspark's internal py4j launches its JavaGateway, it always > invokes `java` directly, without qualification. This means you get whatever > java's first on your path, which is not necessarily the same one in spark's > JAVA_HOME. > This could be seen as a py4j issue, but from their point of view, the fix is > easy: make sure the java you want is first on your path. I can't figure out a > way to make that reliably happen through the pyspark executor launch path, > and it seems like something that would ideally happen automatically. If I set > JAVA_HOME when launching spark, I would expect that to be the only java used > throughout the stack. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16781) java launched by PySpark as gateway may not be the same java used in the spark environment
[ https://issues.apache.org/jira/browse/SPARK-16781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16781: Assignee: (was: Apache Spark) > java launched by PySpark as gateway may not be the same java used in the > spark environment > -- > > Key: SPARK-16781 > URL: https://issues.apache.org/jira/browse/SPARK-16781 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.2 >Reporter: Michael Berman > > When launching spark on a system with multiple javas installed, there are a > few options for choosing which JRE to use, setting `JAVA_HOME` being the most > straightforward. > However, when pyspark's internal py4j launches its JavaGateway, it always > invokes `java` directly, without qualification. This means you get whatever > java's first on your path, which is not necessarily the same one in spark's > JAVA_HOME. > This could be seen as a py4j issue, but from their point of view, the fix is > easy: make sure the java you want is first on your path. I can't figure out a > way to make that reliably happen through the pyspark executor launch path, > and it seems like something that would ideally happen automatically. If I set > JAVA_HOME when launching spark, I would expect that to be the only java used > throughout the stack. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430359#comment-15430359 ] Sean Owen commented on SPARK-17055: --- Yes, I understand how model fitting works. If a label is present in test but not train data, the model will never predict it. This doesn't require any code at all; it's always true. What would you be evaluating by holding out all data with a certain label from training? > add labelKFold to CrossValidator > > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Vincent >Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream packages like Sklearn already supports such cross validation > method. > (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430381#comment-15430381 ] Vincent commented on SPARK-17055: - well, a better model will have a better cv performance on data with unseen labels, so the final selected model will have a relatively better capability on predicting samples with unseen categories/labels in real case. > add labelKFold to CrossValidator > > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Vincent >Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream packages like Sklearn already supports such cross validation > method. > (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17055) add labelKFold to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430381#comment-15430381 ] Vincent edited comment on SPARK-17055 at 8/22/16 9:14 AM: -- well, a better model will have a better cv performance on validation data with unseen labels, so the final selected model will have a relatively better capability on predicting samples with unseen categories/labels in real case. was (Author: vincexie): well, a better model will have a better cv performance on data with unseen labels, so the final selected model will have a relatively better capability on predicting samples with unseen categories/labels in real case. > add labelKFold to CrossValidator > > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Vincent >Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream packages like Sklearn already supports such cross validation > method. > (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10110) StringIndexer lacks of parameter "handleInvalid".
[ https://issues.apache.org/jira/browse/SPARK-10110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath closed SPARK-10110. -- Resolution: Duplicate > StringIndexer lacks of parameter "handleInvalid". > - > > Key: SPARK-10110 > URL: https://issues.apache.org/jira/browse/SPARK-10110 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Kai Sasaki > Labels: ML > > Missing API for pyspark {{StringIndexer.handleInvalid}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430386#comment-15430386 ] Sean Owen commented on SPARK-17055: --- The model will always have 0% accuracy on CV / test data whose label was not in the training data. Can you give me an example that clarifies what you have in mind? I don't think this statement is true. > add labelKFold to CrossValidator > > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Vincent >Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream packages like Sklearn already supports such cross validation > method. > (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17148) NodeManager exit because of exception “Executor is not registered”
[ https://issues.apache.org/jira/browse/SPARK-17148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430402#comment-15430402 ] Saisai Shao commented on SPARK-17148: - I manually verified this by explicitly throwing the RuntimeException in external shuffle service, from my test NM will not exit simply because of this exception. I guess the exit of NM may be due to other issues. > NodeManager exit because of exception “Executor is not registered” > -- > > Key: SPARK-17148 > URL: https://issues.apache.org/jira/browse/SPARK-17148 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.6.2 > Environment: hadoop 2.7.2 spark 1.6.2 >Reporter: cen yuhai > > java.lang.RuntimeException: Executor is not registered > (appId=application_1467288504738_1341061, execId=423) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:85) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:72) > at > org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:149) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16970) [spark2.0] spark2.0 doesn't catch the java exception thrown by reflect function in sql statement which causes the job abort
[ https://issues.apache.org/jira/browse/SPARK-16970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16970. --- Resolution: Not A Problem > [spark2.0] spark2.0 doesn't catch the java exception thrown by reflect > function in sql statement which causes the job abort > --- > > Key: SPARK-16970 > URL: https://issues.apache.org/jira/browse/SPARK-16970 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: marymwu > > [spark2.0] spark2.0 doesn't catch the java exception thrown by reflect > function in the sql statement which causes the job abort > steps: > 1. select reflect('java.net.URLDecoder','decode','%%E7','utf-8') test; > -->"%%" which causes the java exception > error: > 16/08/09 15:56:38 INFO DAGScheduler: Job 1 failed: run at > AccessController.java:-2, took 7.018147 s > 16/08/09 15:56:38 ERROR SparkExecuteStatementOperation: Error executing > query, currentState RUNNING, > org.apache.spark.SparkException: Job aborted due to stage failure: Task 162 > in stage 1.0 failed 8 times, most recent failure: Lost task 162.7 in stage > 1.0 (TID 207, slave7.lenovomm2.com): org.apache.spark.SparkException: Task > failed while writing rows. > at > org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:330) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:131) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:131) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection.eval(CallMethodViaReflection.scala:87) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:288) > ... 8 more > Caused by: java.lang.IllegalArgumentException: URLDecoder: Illegal hex > characters in escape (%) pattern - For input string: "%E" > at java.net.URLDecoder.decode(URLDecoder.java:192) > ... 19 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semet updated SPARK-16367: -- Description: *Rational* Is it recommended, in order to deploying Scala packages written in Scala, to build big fat jar files. This allows to have all dependencies on one package so the only "cost" is copy time to deploy this file on every Spark Node. On the other hand, Python deployment is more difficult once you want to use external packages, and you don't really want to mess with the IT to deploy the packages on the virtualenv of each nodes. This ticket proposes to allow users the ability to deploy their job as "Wheels" packages. The Python community is strongly advocating to promote this way of packaging and distributing Python application as a "standard way of deploying Python App". In other word, this is the "Pythonic Way of Deployment". *Previous approaches* I based the current proposal over the two following bugs related to this point: - SPARK-6764 ("Wheel support for PySpark") - SPARK-13587("Support virtualenv in PySpark") First part of my proposal was to merge, in order to support wheels install and virtualenv creation *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* In Python, the packaging standard is now the "wheels" file format, which goes further that good old ".egg" files. With a wheel file (".whl"), the package is already prepared for a given architecture. You can have several wheels for a given package version, each specific to an architecture, or environment. For example, look at https://pypi.python.org/pypi/numpy all the different version of Wheel available. The {{pip}} tools knows how to select the right wheel file matching the current system, and how to install this package in a light speed (without compilation). Said otherwise, package that requires compilation of a C module, for instance "numpy", does *not* compile anything when installing from wheel file. {{pypi.pypthon.org}} already provided wheels for major python version. It the wheel is not available, pip will compile it from source anyway. Mirroring of Pypi is possible through projects such as http://doc.devpi.net/latest/ (untested) or the Pypi mirror support on Artifactory (tested personnally). {{pip}} also provides the ability to generate easily all wheels of all packages used for a given project which is inside a "virtualenv". This is called "wheelhouse". You can even don't mess with this compilation and retrieve it directly from pypi.python.org. *Use Case 1: no internet connectivity* Here my first proposal for a deployment workflow, in the case where the Spark cluster does not have any internet connectivity or access to a Pypi mirror. In this case the simplest way to deploy a project with several dependencies is to build and then send to complete "wheelhouse": - you are writing a PySpark script that increase in term of size and dependencies. Deploying on Spark for example requires to build numpy or Theano and other dependencies - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script into a standard Python package: -- write a {{requirements.txt}}. I recommend to specify all package version. You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the requirements.txt {code} astroid==1.4.6 # via pylint autopep8==1.2.4 click==6.6 # via pip-tools colorama==0.3.7 # via pylint enum34==1.1.6 # via hypothesis findspark==1.0.0 # via spark-testing-base first==2.0.1 # via pip-tools hypothesis==3.4.0 # via spark-testing-base lazy-object-proxy==1.2.2 # via astroid linecache2==1.0.0 # via traceback2 pbr==1.10.0 pep8==1.7.0 # via autopep8 pip-tools==1.6.5 py==1.4.31 # via pytest pyflakes==1.2.3 pylint==1.5.6 pytest==2.9.2 # via spark-testing-base six==1.10.0 # via astroid, pip-tools, pylint, unittest2 spark-testing-base==0.0.7.post2 traceback2==1.4.0 # via unittest2 unittest2==1.1.0 # via spark-testing-base wheel==0.29.0 wrapt==1.10.8 # via astroid {code} -- write a setup.py with some entry points or package. Use [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining a setup.py files really easy -- create a virtualenv if not already in one: {code} virtualenv env {code} -- Work on your environment, define the requirement you need in {{requirements.txt}}, do all the {{pip install}} you need. - create the wheelhouse for your current project {code} pip install wheelhouse pip wheel . --wheel-dir wheelhouse {code} This can take some times, but at the end you have all the .whl required *for your current system* in a directory {{wheelhouse}}. - zip it into a {{wheelhouse.zip}}. Note that you can have your own package (for instance 'my_package') be generated into a wheel and so installed by {{pip}} automatically. Now comes the time to submit the project: {code} bin/spark-submit --mast
[jira] [Commented] (SPARK-17168) CSV with header is incorrectly read if file is partitioned
[ https://issues.apache.org/jira/browse/SPARK-17168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430444#comment-15430444 ] Takeshi Yamamuro commented on SPARK-17168: -- Seems it is reasonable that Spark writes a header only in a single file when saving DataFrame as csv, and it can correctly re-read them. I imagine we can implement this by remember which file has a header in metadata, a file extension, or something. On the other hand, ISTM it is kind of difficult that Spark decides which file has a header among many files users provide and Spark first reads. A `first` file is ambiguous in case of listing files in parallel https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala#L91 > CSV with header is incorrectly read if file is partitioned > -- > > Key: SPARK-17168 > URL: https://issues.apache.org/jira/browse/SPARK-17168 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Mathieu D >Priority: Minor > > If a CSV file is stored in a partitioned fashion, the DataframeReader.csv > with option header set to true skips the first line of *each partition* > instead of skipping only the first one. > ex: > {code} > // create a partitioned CSV file with header : > val rdd=sc.parallelize(Seq("hdr","1","2","3","4","5","6"), numSlices=2) > rdd.saveAsTextFile("foo") > {code} > Now, if we try to read it with DataframeReader, the first row of the 2nd > partition is skipped. > {code} > val df = spark.read.option("header","true").csv("foo") > df.show > +---+ > |hdr| > +---+ > | 1| > | 2| > | 4| > | 5| > | 6| > +---+ > // one row is missing > {code} > I more or less understand that this is to be consistent with the save > operation of dataframewriter which saves header on each individual partition. > But this is very error-prone. In our case, we have large CSV files with > headers already stored in a partitioned way, so we will lose rows if we read > with header set to true. So we have to manually handle the headers. > I suggest a tri-valued option for header, with something like > "skipOnFirstPartition" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Today's fax
IMG_1462.DOCM Description: IMG_1462.DOCM - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15113) Add missing numFeatures & numClasses to wrapped JavaClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-15113: --- Priority: Minor (was: Major) > Add missing numFeatures & numClasses to wrapped JavaClassificationModel > --- > > Key: SPARK-15113 > URL: https://issues.apache.org/jira/browse/SPARK-15113 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Assignee: holdenk >Priority: Minor > > As part of SPARK-14813 numFeatures and numClasses are missing in many models > in PySpark ML pipeline. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15113) Add missing numFeatures & numClasses to wrapped JavaClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-15113: --- Assignee: holdenk > Add missing numFeatures & numClasses to wrapped JavaClassificationModel > --- > > Key: SPARK-15113 > URL: https://issues.apache.org/jira/browse/SPARK-15113 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Assignee: holdenk > > As part of SPARK-14813 numFeatures and numClasses are missing in many models > in PySpark ML pipeline. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17181) [Spark2.0 web ui]The status of the certain jobs is still displayed as running even if all the stages of this job have already finished
marymwu created SPARK-17181: --- Summary: [Spark2.0 web ui]The status of the certain jobs is still displayed as running even if all the stages of this job have already finished Key: SPARK-17181 URL: https://issues.apache.org/jira/browse/SPARK-17181 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.0.0 Reporter: marymwu Priority: Minor [Spark2.0 web ui]The status of the certain jobs is still displayed as running even if all the stages of this job have already finished Note: not sure what kind of jobs will encounter this problem The following log shows that job 1000 has already been done, but on spark2.0 web ui, the status of job 1000 is still displayed as running, see attached file 16/08/22 16:01:29 INFO DAGScheduler: dag send msg, result task done, job: 1000 16/08/22 16:01:29 INFO DAGScheduler: Job 1000 finished: run at AccessController.java:-2, took 4.664319 s -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15113) Add missing numFeatures & numClasses to wrapped JavaClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-15113. Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 12889 [https://github.com/apache/spark/pull/12889] > Add missing numFeatures & numClasses to wrapped JavaClassificationModel > --- > > Key: SPARK-15113 > URL: https://issues.apache.org/jira/browse/SPARK-15113 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Assignee: holdenk >Priority: Minor > Fix For: 2.1.0 > > > As part of SPARK-14813 numFeatures and numClasses are missing in many models > in PySpark ML pipeline. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17182) CollectList and CollectSet should be marked as non-deterministic
Cheng Lian created SPARK-17182: -- Summary: CollectList and CollectSet should be marked as non-deterministic Key: SPARK-17182 URL: https://issues.apache.org/jira/browse/SPARK-17182 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian {{CollectList}} and {{CollectSet}} should be marked as non-deterministic since their results depend on the actual order of input rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17182) CollectList and CollectSet should be marked as non-deterministic
[ https://issues.apache.org/jira/browse/SPARK-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430532#comment-15430532 ] Apache Spark commented on SPARK-17182: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/14749 > CollectList and CollectSet should be marked as non-deterministic > > > Key: SPARK-17182 > URL: https://issues.apache.org/jira/browse/SPARK-17182 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > {{CollectList}} and {{CollectSet}} should be marked as non-deterministic > since their results depend on the actual order of input rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17182) CollectList and CollectSet should be marked as non-deterministic
[ https://issues.apache.org/jira/browse/SPARK-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17182: Assignee: Apache Spark (was: Cheng Lian) > CollectList and CollectSet should be marked as non-deterministic > > > Key: SPARK-17182 > URL: https://issues.apache.org/jira/browse/SPARK-17182 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Apache Spark > > {{CollectList}} and {{CollectSet}} should be marked as non-deterministic > since their results depend on the actual order of input rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17182) CollectList and CollectSet should be marked as non-deterministic
[ https://issues.apache.org/jira/browse/SPARK-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17182: Assignee: Cheng Lian (was: Apache Spark) > CollectList and CollectSet should be marked as non-deterministic > > > Key: SPARK-17182 > URL: https://issues.apache.org/jira/browse/SPARK-17182 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > {{CollectList}} and {{CollectSet}} should be marked as non-deterministic > since their results depend on the actual order of input rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11215) Add multiple columns support to StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-11215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang reassigned SPARK-11215: --- Assignee: Yanbo Liang > Add multiple columns support to StringIndexer > - > > Key: SPARK-11215 > URL: https://issues.apache.org/jira/browse/SPARK-11215 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > Add multiple columns support to StringIndexer, then users can transform > multiple input columns to multiple output columns simultaneously. See > discussion SPARK-8418. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17181) [Spark2.0 web ui]The status of the certain jobs is still displayed as running even if all the stages of this job have already finished
[ https://issues.apache.org/jira/browse/SPARK-17181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] marymwu updated SPARK-17181: Attachment: job1000-2.png job1000-1.png > [Spark2.0 web ui]The status of the certain jobs is still displayed as running > even if all the stages of this job have already finished > --- > > Key: SPARK-17181 > URL: https://issues.apache.org/jira/browse/SPARK-17181 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 >Reporter: marymwu >Priority: Minor > Attachments: job1000-1.png, job1000-2.png > > > [Spark2.0 web ui]The status of the certain jobs is still displayed as running > even if all the stages of this job have already finished > Note: not sure what kind of jobs will encounter this problem > The following log shows that job 1000 has already been done, but on spark2.0 > web ui, the status of job 1000 is still displayed as running, see attached > file > 16/08/22 16:01:29 INFO DAGScheduler: dag send msg, result task done, job: 1000 > 16/08/22 16:01:29 INFO DAGScheduler: Job 1000 finished: run at > AccessController.java:-2, took 4.664319 s -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7493) ALTER TABLE statement
[ https://issues.apache.org/jira/browse/SPARK-7493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430650#comment-15430650 ] Sergey Semichev commented on SPARK-7493: Good to know, thanks > ALTER TABLE statement > - > > Key: SPARK-7493 > URL: https://issues.apache.org/jira/browse/SPARK-7493 > Project: Spark > Issue Type: Bug > Components: SQL > Environment: Databricks cloud >Reporter: Sergey Semichev >Priority: Minor > > Full table name (database_name.table_name) cannot be used with "ALTER TABLE" > statement > It works with CREATE TABLE > "ALTER TABLE database_name.table_name ADD PARTITION (source_year='2014', > source_month='01')." > Error in SQL statement: java.lang.RuntimeException: > org.apache.spark.sql.AnalysisException: mismatched input 'ADD' expecting > KW_EXCHANGE near 'test_table' in alter exchange partition; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7493) ALTER TABLE statement
[ https://issues.apache.org/jira/browse/SPARK-7493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Semichev closed SPARK-7493. -- Resolution: Fixed > ALTER TABLE statement > - > > Key: SPARK-7493 > URL: https://issues.apache.org/jira/browse/SPARK-7493 > Project: Spark > Issue Type: Bug > Components: SQL > Environment: Databricks cloud >Reporter: Sergey Semichev >Priority: Minor > > Full table name (database_name.table_name) cannot be used with "ALTER TABLE" > statement > It works with CREATE TABLE > "ALTER TABLE database_name.table_name ADD PARTITION (source_year='2014', > source_month='01')." > Error in SQL statement: java.lang.RuntimeException: > org.apache.spark.sql.AnalysisException: mismatched input 'ADD' expecting > KW_EXCHANGE near 'test_table' in alter exchange partition; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17183) put hive serde table schema to table properties like data source table
Wenchen Fan created SPARK-17183: --- Summary: put hive serde table schema to table properties like data source table Key: SPARK-17183 URL: https://issues.apache.org/jira/browse/SPARK-17183 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17183) put hive serde table schema to table properties like data source table
[ https://issues.apache.org/jira/browse/SPARK-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430718#comment-15430718 ] Apache Spark commented on SPARK-17183: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/14750 > put hive serde table schema to table properties like data source table > -- > > Key: SPARK-17183 > URL: https://issues.apache.org/jira/browse/SPARK-17183 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17183) put hive serde table schema to table properties like data source table
[ https://issues.apache.org/jira/browse/SPARK-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17183: Assignee: Wenchen Fan (was: Apache Spark) > put hive serde table schema to table properties like data source table > -- > > Key: SPARK-17183 > URL: https://issues.apache.org/jira/browse/SPARK-17183 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17183) put hive serde table schema to table properties like data source table
[ https://issues.apache.org/jira/browse/SPARK-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17183: Assignee: Apache Spark (was: Wenchen Fan) > put hive serde table schema to table properties like data source table > -- > > Key: SPARK-17183 > URL: https://issues.apache.org/jira/browse/SPARK-17183 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16914) NodeManager crash when spark are registering executor infomartion into leveldb
[ https://issues.apache.org/jira/browse/SPARK-16914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430725#comment-15430725 ] Thomas Graves commented on SPARK-16914: --- that is considered a fatal issue for the nodemanager and its expected to fail. hardware goes bad sometimes so you either make sure these paths are durable or the nodemanager is going to crash. not sure what other options you have here > NodeManager crash when spark are registering executor infomartion into leveldb > -- > > Key: SPARK-16914 > URL: https://issues.apache.org/jira/browse/SPARK-16914 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.6.2 >Reporter: cen yuhai > > {noformat} > Stack: [0x7fb5b53de000,0x7fb5b54df000], sp=0x7fb5b54dcba8, free > space=1018k > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native > code) > C [libc.so.6+0x896b1] memcpy+0x11 > Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) > j > org.fusesource.leveldbjni.internal.NativeDB$DBJNI.Put(JLorg/fusesource/leveldbjni/internal/NativeWriteOptions;Lorg/fusesource/leveldbjni/internal/NativeSlice;Lorg/fusesource/leveldbjni/internal/NativeSlice;)J+0 > j > org.fusesource.leveldbjni.internal.NativeDB.put(Lorg/fusesource/leveldbjni/internal/NativeWriteOptions;Lorg/fusesource/leveldbjni/internal/NativeSlice;Lorg/fusesource/leveldbjni/internal/NativeSlice;)V+11 > j > org.fusesource.leveldbjni.internal.NativeDB.put(Lorg/fusesource/leveldbjni/internal/NativeWriteOptions;Lorg/fusesource/leveldbjni/internal/NativeBuffer;Lorg/fusesource/leveldbjni/internal/NativeBuffer;)V+18 > j > org.fusesource.leveldbjni.internal.NativeDB.put(Lorg/fusesource/leveldbjni/internal/NativeWriteOptions;[B[B)V+36 > j > org.fusesource.leveldbjni.internal.JniDB.put([B[BLorg/iq80/leveldb/WriteOptions;)Lorg/iq80/leveldb/Snapshot;+28 > j org.fusesource.leveldbjni.internal.JniDB.put([B[B)V+10 > j > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.registerExecutor(Ljava/lang/String;Ljava/lang/String;Lorg/apache/spark/network/shuffle/protocol/ExecutorShuffleInfo;)V+61 > J 8429 C2 > org.apache.spark.network.server.TransportRequestHandler.handle(Lorg/apache/spark/network/protocol/RequestMessage;)V > (100 bytes) @ 0x7fb5f27ff6cc [0x7fb5f27fdde0+0x18ec] > J 8371 C2 > org.apache.spark.network.server.TransportChannelHandler.channelRead0(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V > (10 bytes) @ 0x7fb5f242df20 [0x7fb5f242de80+0xa0] > J 6853 C2 > io.netty.channel.SimpleChannelInboundHandler.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V > (74 bytes) @ 0x7fb5f215587c [0x7fb5f21557e0+0x9c] > J 5872 C2 > io.netty.handler.timeout.IdleStateHandler.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V > (42 bytes) @ 0x7fb5f2183268 [0x7fb5f2183100+0x168] > J 5849 C2 > io.netty.handler.codec.MessageToMessageDecoder.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V > (158 bytes) @ 0x7fb5f2191524 [0x7fb5f218f5a0+0x1f84] > J 5941 C2 > org.apache.spark.network.util.TransportFrameDecoder.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V > (170 bytes) @ 0x7fb5f220a230 [0x7fb5f2209fc0+0x270] > J 7747 C2 io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read()V > (363 bytes) @ 0x7fb5f264465c [0x7fb5f2644140+0x51c] > J 8008% C2 io.netty.channel.nio.NioEventLoop.run()V (162 bytes) @ > 0x7fb5f26f6764 [0x7fb5f26f63c0+0x3a4] > j io.netty.util.concurrent.SingleThreadEventExecutor$2.run()V+13 > j java.lang.Thread.run()V+11 > v ~StubRoutines::call_stub > {noformat} > The target code in spark is in ExternalShuffleBlockResolver > {code} > /** Registers a new Executor with all the configuration we need to find its > shuffle files. */ > public void registerExecutor( > String appId, > String execId, > ExecutorShuffleInfo executorInfo) { > AppExecId fullId = new AppExecId(appId, execId); > logger.info("Registered executor {} with {}", fullId, executorInfo); > try { > if (db != null) { > byte[] key = dbAppExecKey(fullId); > byte[] value = > mapper.writeValueAsString(executorInfo).getBytes(Charsets.UTF_8); > db.put(key, value); > } > } catch (Exception e) { > logger.error("Error saving registered executors", e); > } > executors.put(fullId, executorInfo); > } > {code} > There is a problem with disk1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-
[jira] [Created] (SPARK-17184) Replace ByteBuf with InputStream
Guoqiang Li created SPARK-17184: --- Summary: Replace ByteBuf with InputStream Key: SPARK-17184 URL: https://issues.apache.org/jira/browse/SPARK-17184 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Guoqiang Li The size of ByteBuf can not be greater than 2G, should be replaced by InputStream. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17185) Unify naming of API for RDD and Dataset
Xiang Gao created SPARK-17185: - Summary: Unify naming of API for RDD and Dataset Key: SPARK-17185 URL: https://issues.apache.org/jira/browse/SPARK-17185 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Xiang Gao In RDD, groupByKey is used to generate a key-list pair and aggregateByKey is used to do aggregation. In Dataset, aggregation is done by groupBy and groupByKey, and no API for key-list pair is provided. The same name "groupBy" is designed to do different things and this might be be confusing. Besides, it would be more convenient to provide API to generate key-list pair for Dataset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17185) Unify naming of API for RDD and Dataset
[ https://issues.apache.org/jira/browse/SPARK-17185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17185: -- Priority: Minor (was: Major) I know what you mean, but is there a way to do this without changing the API? I'm not sure there is. I am not sure it's worth changing the API for this. > Unify naming of API for RDD and Dataset > --- > > Key: SPARK-17185 > URL: https://issues.apache.org/jira/browse/SPARK-17185 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Reporter: Xiang Gao >Priority: Minor > > In RDD, groupByKey is used to generate a key-list pair and aggregateByKey is > used to do aggregation. > In Dataset, aggregation is done by groupBy and groupByKey, and no API for > key-list pair is provided. > The same name "groupBy" is designed to do different things and this might be > be confusing. Besides, it would be more convenient to provide API to generate > key-list pair for Dataset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17184) Replace ByteBuf with InputStream
[ https://issues.apache.org/jira/browse/SPARK-17184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430753#comment-15430753 ] Sean Owen commented on SPARK-17184: --- Before opening JIRAs, could you please respond to the request for a brief design doc? this is a significant change, so we'd want to get some consensus on the approach before you go do work, and also record in JIRA tasks that seem to indicate a particular direction. I'm not sure that's agreed yet > Replace ByteBuf with InputStream > > > Key: SPARK-17184 > URL: https://issues.apache.org/jira/browse/SPARK-17184 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Guoqiang Li > > The size of ByteBuf can not be greater than 2G, should be replaced by > InputStream. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17186) remove catalog table type INDEX
Wenchen Fan created SPARK-17186: --- Summary: remove catalog table type INDEX Key: SPARK-17186 URL: https://issues.apache.org/jira/browse/SPARK-17186 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17184) Replace ByteBuf with InputStream
[ https://issues.apache.org/jira/browse/SPARK-17184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17184: Assignee: Apache Spark > Replace ByteBuf with InputStream > > > Key: SPARK-17184 > URL: https://issues.apache.org/jira/browse/SPARK-17184 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Guoqiang Li >Assignee: Apache Spark > > The size of ByteBuf can not be greater than 2G, should be replaced by > InputStream. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17184) Replace ByteBuf with InputStream
[ https://issues.apache.org/jira/browse/SPARK-17184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430773#comment-15430773 ] Apache Spark commented on SPARK-17184: -- User 'witgo' has created a pull request for this issue: https://github.com/apache/spark/pull/14751 > Replace ByteBuf with InputStream > > > Key: SPARK-17184 > URL: https://issues.apache.org/jira/browse/SPARK-17184 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Guoqiang Li > > The size of ByteBuf can not be greater than 2G, should be replaced by > InputStream. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17184) Replace ByteBuf with InputStream
[ https://issues.apache.org/jira/browse/SPARK-17184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17184: Assignee: (was: Apache Spark) > Replace ByteBuf with InputStream > > > Key: SPARK-17184 > URL: https://issues.apache.org/jira/browse/SPARK-17184 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Guoqiang Li > > The size of ByteBuf can not be greater than 2G, should be replaced by > InputStream. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17178) Allow to set sparkr shell command through --conf
[ https://issues.apache.org/jira/browse/SPARK-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17178: Assignee: (was: Apache Spark) > Allow to set sparkr shell command through --conf > > > Key: SPARK-17178 > URL: https://issues.apache.org/jira/browse/SPARK-17178 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, SparkR >Affects Versions: 2.0.0 >Reporter: Jeff Zhang > > For now the only way to set sparkr shell command it through environment > variable SPARKR_DRIVER_R which is not so convenient. Configuration > spark.r.command and spark.r.driver.command is for specify R command of > running R script. So I think it is natural to provide configuration for > sparkr shell command as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17178) Allow to set sparkr shell command through --conf
[ https://issues.apache.org/jira/browse/SPARK-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17178: Assignee: Apache Spark > Allow to set sparkr shell command through --conf > > > Key: SPARK-17178 > URL: https://issues.apache.org/jira/browse/SPARK-17178 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, SparkR >Affects Versions: 2.0.0 >Reporter: Jeff Zhang >Assignee: Apache Spark > > For now the only way to set sparkr shell command it through environment > variable SPARKR_DRIVER_R which is not so convenient. Configuration > spark.r.command and spark.r.driver.command is for specify R command of > running R script. So I think it is natural to provide configuration for > sparkr shell command as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17178) Allow to set sparkr shell command through --conf
[ https://issues.apache.org/jira/browse/SPARK-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430787#comment-15430787 ] Apache Spark commented on SPARK-17178: -- User 'zjffdu' has created a pull request for this issue: https://github.com/apache/spark/pull/14744 > Allow to set sparkr shell command through --conf > > > Key: SPARK-17178 > URL: https://issues.apache.org/jira/browse/SPARK-17178 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, SparkR >Affects Versions: 2.0.0 >Reporter: Jeff Zhang > > For now the only way to set sparkr shell command it through environment > variable SPARKR_DRIVER_R which is not so convenient. Configuration > spark.r.command and spark.r.driver.command is for specify R command of > running R script. So I think it is natural to provide configuration for > sparkr shell command as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17186) remove catalog table type INDEX
[ https://issues.apache.org/jira/browse/SPARK-17186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17186: Assignee: Wenchen Fan (was: Apache Spark) > remove catalog table type INDEX > --- > > Key: SPARK-17186 > URL: https://issues.apache.org/jira/browse/SPARK-17186 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17186) remove catalog table type INDEX
[ https://issues.apache.org/jira/browse/SPARK-17186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430813#comment-15430813 ] Apache Spark commented on SPARK-17186: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/14752 > remove catalog table type INDEX > --- > > Key: SPARK-17186 > URL: https://issues.apache.org/jira/browse/SPARK-17186 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17186) remove catalog table type INDEX
[ https://issues.apache.org/jira/browse/SPARK-17186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17186: Assignee: Apache Spark (was: Wenchen Fan) > remove catalog table type INDEX > --- > > Key: SPARK-17186 > URL: https://issues.apache.org/jira/browse/SPARK-17186 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7493) ALTER TABLE statement
[ https://issues.apache.org/jira/browse/SPARK-7493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430821#comment-15430821 ] Dongjoon Hyun commented on SPARK-7493: -- Thank you! > ALTER TABLE statement > - > > Key: SPARK-7493 > URL: https://issues.apache.org/jira/browse/SPARK-7493 > Project: Spark > Issue Type: Bug > Components: SQL > Environment: Databricks cloud >Reporter: Sergey Semichev >Priority: Minor > > Full table name (database_name.table_name) cannot be used with "ALTER TABLE" > statement > It works with CREATE TABLE > "ALTER TABLE database_name.table_name ADD PARTITION (source_year='2014', > source_month='01')." > Error in SQL statement: java.lang.RuntimeException: > org.apache.spark.sql.AnalysisException: mismatched input 'ADD' expecting > KW_EXCHANGE near 'test_table' in alter exchange partition; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17181) [Spark2.0 web ui]The status of the certain jobs is still displayed as running even if all the stages of this job have already finished
[ https://issues.apache.org/jira/browse/SPARK-17181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430855#comment-15430855 ] Thomas Graves commented on SPARK-17181: --- check your log files to see if you see something like: logError("Dropping SparkListenerEvent because no remaining room in event queue. " + "This likely means one of the SparkListeners is too slow and cannot keep up with " + "the rate at which tasks are being started by the scheduler.") If so the event listener queue just dropped some events and the Ui can get out of sync when that happens. > [Spark2.0 web ui]The status of the certain jobs is still displayed as running > even if all the stages of this job have already finished > --- > > Key: SPARK-17181 > URL: https://issues.apache.org/jira/browse/SPARK-17181 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 >Reporter: marymwu >Priority: Minor > Attachments: job1000-1.png, job1000-2.png > > > [Spark2.0 web ui]The status of the certain jobs is still displayed as running > even if all the stages of this job have already finished > Note: not sure what kind of jobs will encounter this problem > The following log shows that job 1000 has already been done, but on spark2.0 > web ui, the status of job 1000 is still displayed as running, see attached > file > 16/08/22 16:01:29 INFO DAGScheduler: dag send msg, result task done, job: 1000 > 16/08/22 16:01:29 INFO DAGScheduler: Job 1000 finished: run at > AccessController.java:-2, took 4.664319 s -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets
[ https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430872#comment-15430872 ] Cody Koeninger commented on SPARK-17147: My point is more that this probably isn't just two lines in CachedKafkaConsumer. There's other code, both within the spark streaming connector and in users of the connector, that assumes an offset range from..until has a number of messages equal to (until - from). I haven't seen what databricks is coming up with for the structured streaming connector, but I'd imagine that an assumption that offsets are contiguous would certainly simplify that implementation, and might actually be necessary depending on how recovery works. This might be a simple as your change plus logging a warning when a stream starts on a compacted topic, but we need to think through the issues here. > Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets > > > Key: SPARK-17147 > URL: https://issues.apache.org/jira/browse/SPARK-17147 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Robert Conrad > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset > will always be just an increment of 1 above the previous offset. > I have worked around this problem by changing CachedKafkaConsumer to use the > returned record's offset, from: > {{nextOffset = offset + 1}} > to: > {{nextOffset = record.offset + 1}} > and changed KafkaRDD from: > {{requestOffset += 1}} > to: > {{requestOffset = r.offset() + 1}} > (I also had to change some assert logic in CachedKafkaConsumer). > There's a strong possibility that I have misconstrued how to use the > streaming kafka consumer, and I'm happy to close this out if that's the case. > If, however, it is supposed to support non-consecutive offsets (e.g. due to > log compaction) I am also happy to contribute a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14948) Exception when joining DataFrames derived form the same DataFrame
[ https://issues.apache.org/jira/browse/SPARK-14948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430881#comment-15430881 ] Wenchen Fan edited comment on SPARK-14948 at 8/22/16 2:38 PM: -- Can you check it with 2.0? I converted your code snippet into scala version and try with the latest code, it works. was (Author: cloud_fan): Can you double check it? I converted your code snippet into scala version and try with the latest code, it works. > Exception when joining DataFrames derived form the same DataFrame > - > > Key: SPARK-14948 > URL: https://issues.apache.org/jira/browse/SPARK-14948 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Saurabh Santhosh > > h2. Spark Analyser is throwing the following exception in a specific scenario > : > h2. Exception : > org.apache.spark.sql.AnalysisException: resolved attribute(s) F1#3 missing > from asd#5,F2#4,F1#6,F2#7 in operator !Project [asd#5,F1#3]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > h2. Code : > {code:title=SparkClient.java|borderStyle=solid} > StructField[] fields = new StructField[2]; > fields[0] = new StructField("F1", DataTypes.StringType, true, > Metadata.empty()); > fields[1] = new StructField("F2", DataTypes.StringType, true, > Metadata.empty()); > JavaRDD rdd = > > sparkClient.getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("a", > "b"))); > DataFrame df = sparkClient.getSparkHiveContext().createDataFrame(rdd, new > StructType(fields)); > sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t1"); > DataFrame aliasedDf = sparkClient.getSparkHiveContext().sql("select F1 as > asd, F2 from t1"); > sparkClient.getSparkHiveContext().registerDataFrameAsTable(aliasedDf, > "t2"); > sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t3"); > > DataFrame join = aliasedDf.join(df, > aliasedDf.col("F2").equalTo(df.col("F2")), "inner"); > DataFrame select = join.select(aliasedDf.col("asd"), df.col("F1")); > select.collect(); > {code} > h2. Observations : > * This issue is related to the Data Type of Fields of the initial Data > Frame.(If the Data Type is not String, it will work.) > * It works fine if the data frame is registered as a temporary table and an > sql (select a.asd,b.F1 from t2 a inner join t3 b on a.F2=b.F2) is written. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14948) Exception when joining DataFrames derived form the same DataFrame
[ https://issues.apache.org/jira/browse/SPARK-14948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430881#comment-15430881 ] Wenchen Fan commented on SPARK-14948: - Can you double check it? I converted your code snippet into scala version and try with the latest code, it works. > Exception when joining DataFrames derived form the same DataFrame > - > > Key: SPARK-14948 > URL: https://issues.apache.org/jira/browse/SPARK-14948 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Saurabh Santhosh > > h2. Spark Analyser is throwing the following exception in a specific scenario > : > h2. Exception : > org.apache.spark.sql.AnalysisException: resolved attribute(s) F1#3 missing > from asd#5,F2#4,F1#6,F2#7 in operator !Project [asd#5,F1#3]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > h2. Code : > {code:title=SparkClient.java|borderStyle=solid} > StructField[] fields = new StructField[2]; > fields[0] = new StructField("F1", DataTypes.StringType, true, > Metadata.empty()); > fields[1] = new StructField("F2", DataTypes.StringType, true, > Metadata.empty()); > JavaRDD rdd = > > sparkClient.getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("a", > "b"))); > DataFrame df = sparkClient.getSparkHiveContext().createDataFrame(rdd, new > StructType(fields)); > sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t1"); > DataFrame aliasedDf = sparkClient.getSparkHiveContext().sql("select F1 as > asd, F2 from t1"); > sparkClient.getSparkHiveContext().registerDataFrameAsTable(aliasedDf, > "t2"); > sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t3"); > > DataFrame join = aliasedDf.join(df, > aliasedDf.col("F2").equalTo(df.col("F2")), "inner"); > DataFrame select = join.select(aliasedDf.col("asd"), df.col("F1")); > select.collect(); > {code} > h2. Observations : > * This issue is related to the Data Type of Fields of the initial Data > Frame.(If the Data Type is not String, it will work.) > * It works fine if the data frame is registered as a temporary table and an > sql (select a.asd,b.F1 from t2 a inner join t3 b on a.F2=b.F2) is written. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17185) Unify naming of API for RDD and Dataset
[ https://issues.apache.org/jira/browse/SPARK-17185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430897#comment-15430897 ] Xiang Gao commented on SPARK-17185: --- Changing API is a bad idea and we should not do this. Maybe these changes might help(I'm not sure): * add {{aggregateByKey}} and {{aggregateBy}} to {{Dataset}}, which does exactly the same thing as {{groupByKey}} and {{groupBy}} does now. * The return value of {{aggregateByKey}} and {{aggregateBy}} should be two new class: {{KeyValueAggregatedDataset}} and {{RelationalAggregatedDataset}}, which is a copy of {{KeyValueGroupedDataset}} and {{RelationalGroupedDataset}} now. * add new methods to get a key-list pair to class {{KeyValueGroupedDataset}} and {{RelationalAggregatedDataset}} and maybe deprecate the methods to do aggregation in these two class > Unify naming of API for RDD and Dataset > --- > > Key: SPARK-17185 > URL: https://issues.apache.org/jira/browse/SPARK-17185 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Reporter: Xiang Gao >Priority: Minor > > In RDD, groupByKey is used to generate a key-list pair and aggregateByKey is > used to do aggregation. > In Dataset, aggregation is done by groupBy and groupByKey, and no API for > key-list pair is provided. > The same name "groupBy" is designed to do different things and this might be > be confusing. Besides, it would be more convenient to provide API to generate > key-list pair for Dataset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14948) Exception when joining DataFrames derived form the same DataFrame
[ https://issues.apache.org/jira/browse/SPARK-14948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430907#comment-15430907 ] Wenchen Fan commented on SPARK-14948: - actually `registerDataFrameAsTable` registers the dataframe as a temp table, see the document of this method: {code} /** * Registers the given [[DataFrame]] as a temporary table in the catalog. Temporary tables exist * only during the lifetime of this instance of SQLContext. */ private[sql] def registerDataFrameAsTable(df: DataFrame, tableName: String): Unit = { catalog.registerTable(TableIdentifier(tableName), df.logicalPlan) } {code} > Exception when joining DataFrames derived form the same DataFrame > - > > Key: SPARK-14948 > URL: https://issues.apache.org/jira/browse/SPARK-14948 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Saurabh Santhosh > > h2. Spark Analyser is throwing the following exception in a specific scenario > : > h2. Exception : > org.apache.spark.sql.AnalysisException: resolved attribute(s) F1#3 missing > from asd#5,F2#4,F1#6,F2#7 in operator !Project [asd#5,F1#3]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > h2. Code : > {code:title=SparkClient.java|borderStyle=solid} > StructField[] fields = new StructField[2]; > fields[0] = new StructField("F1", DataTypes.StringType, true, > Metadata.empty()); > fields[1] = new StructField("F2", DataTypes.StringType, true, > Metadata.empty()); > JavaRDD rdd = > > sparkClient.getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("a", > "b"))); > DataFrame df = sparkClient.getSparkHiveContext().createDataFrame(rdd, new > StructType(fields)); > sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t1"); > DataFrame aliasedDf = sparkClient.getSparkHiveContext().sql("select F1 as > asd, F2 from t1"); > sparkClient.getSparkHiveContext().registerDataFrameAsTable(aliasedDf, > "t2"); > sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t3"); > > DataFrame join = aliasedDf.join(df, > aliasedDf.col("F2").equalTo(df.col("F2")), "inner"); > DataFrame select = join.select(aliasedDf.col("asd"), df.col("F1")); > select.collect(); > {code} > h2. Observations : > * This issue is related to the Data Type of Fields of the initial Data > Frame.(If the Data Type is not String, it will work.) > * It works fine if the data frame is registered as a temporary table and an > sql (select a.asd,b.F1 from t2 a inner join t3 b on a.F2=b.F2) is written. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17185) Unify naming of API for RDD and Dataset
[ https://issues.apache.org/jira/browse/SPARK-17185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiang Gao updated SPARK-17185: -- Description: In {{RDD}}, {{groupByKey}} is used to generate a key-list pair and {{aggregateByKey}} is used to do aggregation. In {{Dataset}}, aggregation is done by {{groupBy}} and {{groupByKey}}, and no API for key-list pair is provided. The same name {{groupBy}} is designed to do different things and this might be be confusing. Besides, it would be more convenient to provide API to generate key-list pair for {{Dataset}}. was: In RDD, groupByKey is used to generate a key-list pair and aggregateByKey is used to do aggregation. In Dataset, aggregation is done by groupBy and groupByKey, and no API for key-list pair is provided. The same name "groupBy" is designed to do different things and this might be be confusing. Besides, it would be more convenient to provide API to generate key-list pair for Dataset. > Unify naming of API for RDD and Dataset > --- > > Key: SPARK-17185 > URL: https://issues.apache.org/jira/browse/SPARK-17185 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Reporter: Xiang Gao >Priority: Minor > > In {{RDD}}, {{groupByKey}} is used to generate a key-list pair and > {{aggregateByKey}} is used to do aggregation. > In {{Dataset}}, aggregation is done by {{groupBy}} and {{groupByKey}}, and no > API for key-list pair is provided. > The same name {{groupBy}} is designed to do different things and this might > be be confusing. Besides, it would be more convenient to provide API to > generate key-list pair for {{Dataset}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17185) Unify naming of API for RDD and Dataset
[ https://issues.apache.org/jira/browse/SPARK-17185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430897#comment-15430897 ] Xiang Gao edited comment on SPARK-17185 at 8/22/16 2:59 PM: Changing API is a bad idea and we should not do this. Maybe these changes might help(I'm not sure): * add {{aggregateByKey}} and {{aggregateBy}} to {{Dataset}}, which does exactly the same thing as {{groupByKey}} and {{groupBy}} does now. * The return value of {{aggregateByKey}} and {{aggregateBy}} should be two new class: {{KeyValueAggregatedDataset}} and {{RelationalAggregatedDataset}}, which is a copy of {{KeyValueGroupedDataset}} and {{RelationalGroupedDataset}} now. * add new methods to get a key-list pair to class {{KeyValueGroupedDataset}} and {{RelationalGroupedDataset}} and maybe deprecate the methods to do aggregation in these two class was (Author: zasdfgbnm): Changing API is a bad idea and we should not do this. Maybe these changes might help(I'm not sure): * add {{aggregateByKey}} and {{aggregateBy}} to {{Dataset}}, which does exactly the same thing as {{groupByKey}} and {{groupBy}} does now. * The return value of {{aggregateByKey}} and {{aggregateBy}} should be two new class: {{KeyValueAggregatedDataset}} and {{RelationalAggregatedDataset}}, which is a copy of {{KeyValueGroupedDataset}} and {{RelationalGroupedDataset}} now. * add new methods to get a key-list pair to class {{KeyValueGroupedDataset}} and {{RelationalAggregatedDataset}} and maybe deprecate the methods to do aggregation in these two class > Unify naming of API for RDD and Dataset > --- > > Key: SPARK-17185 > URL: https://issues.apache.org/jira/browse/SPARK-17185 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Reporter: Xiang Gao >Priority: Minor > > In {{RDD}}, {{groupByKey}} is used to generate a key-list pair and > {{aggregateByKey}} is used to do aggregation. > In {{Dataset}}, aggregation is done by {{groupBy}} and {{groupByKey}}, and no > API for key-list pair is provided. > The same name {{groupBy}} is designed to do different things and this might > be be confusing. Besides, it would be more convenient to provide API to > generate key-list pair for {{Dataset}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object
Sean Zhong created SPARK-17187: -- Summary: Support using arbitrary Java object as internal aggregation buffer object Key: SPARK-17187 URL: https://issues.apache.org/jira/browse/SPARK-17187 Project: Spark Issue Type: New Feature Components: SQL Reporter: Sean Zhong *Background* For aggregation functions like sum and count, Spark-Sql internally use an aggregation buffer to store the intermediate aggregation result for all aggregation functions. Each aggregation function will occupy a section of aggregation buffer. *Problem* Currently, Spark-sql only allows a small set of Spark-Sql supported storage data types stored in aggregation buffer, which is not very convenient or performant, there are several typical cases like: 1. If the aggregation has a complex model CountMinSketch, it is not very easy to convert the complex model so that it can be stored with limited Spark-sql supported data types. 2. It is hard to reuse aggregation class definition defined in existing libraries like algebird. 3. It may introduces heavy serialization/deserialization cost when converting a domain model to Spark sql supported data type. For example, the current implementation of `TypedAggregateExpression` requires serialization/de-serialization for each call of update or merge. *Proposal* We propose: 1. Introduces a TypedImperativeAggregate which allows using arbitrary java object as aggregation buffer, with requirements like: a. It is flexible enough that the API allows using any java object as aggregation buffer, so that it is easier to integrate with existing Monoid libraries like algebird. b. We don't need to call serialize/deserialize for each call of update/merge. Instead, only a few serialization/deserialization operations are needed. This is to guarantee the performance. 2. Refactors `TypedAggregateExpression` to use this new interface, to get higher performance. 3. Implements Appro-Percentile and other aggregation functions which has a complex aggregation object with this new interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object
[ https://issues.apache.org/jira/browse/SPARK-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17187: --- Description: *Background* For aggregation functions like sum and count, Spark-Sql internally use an aggregation buffer to store the intermediate aggregation result for all aggregation functions. Each aggregation function will occupy a section of aggregation buffer. *Problem* Currently, Spark-sql only allows a small set of Spark-Sql supported storage data types stored in aggregation buffer, which is not very convenient or performant, there are several typical cases like: 1. If the aggregation has a complex model CountMinSketch, it is not very easy to convert the complex model so that it can be stored with limited Spark-sql supported data types. 2. It is hard to reuse aggregation class definition defined in existing libraries like algebird. 3. It may introduces heavy serialization/deserialization cost when converting a domain model to Spark sql supported data type. For example, the current implementation of `TypedAggregateExpression` requires serialization/de-serialization for each call of update or merge. *Proposal* We propose: 1. Introduces a TypedImperativeAggregate which allows using arbitrary java object as aggregation buffer, with requirements like: - It is flexible enough that the API allows using any java object as aggregation buffer, so that it is easier to integrate with existing Monoid libraries like algebird. - We don't need to call serialize/deserialize for each call of update/merge. Instead, only a few serialization/deserialization operations are needed. This is to guarantee theperformance. 2. Refactors `TypedAggregateExpression` to use this new interface, to get higher performance. 3. Implements Appro-Percentile and other aggregation functions which has a complex aggregation object with this new interface. was: *Background* For aggregation functions like sum and count, Spark-Sql internally use an aggregation buffer to store the intermediate aggregation result for all aggregation functions. Each aggregation function will occupy a section of aggregation buffer. *Problem* Currently, Spark-sql only allows a small set of Spark-Sql supported storage data types stored in aggregation buffer, which is not very convenient or performant, there are several typical cases like: 1. If the aggregation has a complex model CountMinSketch, it is not very easy to convert the complex model so that it can be stored with limited Spark-sql supported data types. 2. It is hard to reuse aggregation class definition defined in existing libraries like algebird. 3. It may introduces heavy serialization/deserialization cost when converting a domain model to Spark sql supported data type. For example, the current implementation of `TypedAggregateExpression` requires serialization/de-serialization for each call of update or merge. *Proposal* We propose: 1. Introduces a TypedImperativeAggregate which allows using arbitrary java object as aggregation buffer, with requirements like: - It is flexible enough that the API allows using any java object as aggregation buffer, so that it is easier to integrate with existing Monoid libraries like algebird. - We don't need to call serialize/deserialize for each call of update/merge. Instead, only a few serialization/deserialization operations are needed. This is to guarantee theperformance. 2. Refactors `TypedAggregateExpression` to use this new interface, to get higher performance. 3. Implements Appro-Percentile and other aggregation functions which has a complex aggregation object with this new interface. > Support using arbitrary Java object as internal aggregation buffer object > - > > Key: SPARK-17187 > URL: https://issues.apache.org/jira/browse/SPARK-17187 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Sean Zhong > > *Background* > For aggregation functions like sum and count, Spark-Sql internally use an > aggregation buffer to store the intermediate aggregation result for all > aggregation functions. Each aggregation function will occupy a section of > aggregation buffer. > *Problem* > Currently, Spark-sql only allows a small set of Spark-Sql supported storage > data types stored in aggregation buffer, which is not very convenient or > performant, there are several typical cases like: > 1. If the aggregation has a complex model CountMinSketch, it is not very easy > to convert the complex model so that it can be stored with limited Spark-sql > supported data types. > 2. It is hard to reuse aggregation class definition defined in existing > libraries like algebird. > 3. It may introduces heavy serialization/deserializat
[jira] [Updated] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object
[ https://issues.apache.org/jira/browse/SPARK-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17187: --- Description: *Background* For aggregation functions like sum and count, Spark-Sql internally use an aggregation buffer to store the intermediate aggregation result for all aggregation functions. Each aggregation function will occupy a section of aggregation buffer. *Problem* Currently, Spark-sql only allows a small set of Spark-Sql supported storage data types stored in aggregation buffer, which is not very convenient or performant, there are several typical cases like: 1. If the aggregation has a complex model CountMinSketch, it is not very easy to convert the complex model so that it can be stored with limited Spark-sql supported data types. 2. It is hard to reuse aggregation class definition defined in existing libraries like algebird. 3. It may introduces heavy serialization/deserialization cost when converting a domain model to Spark sql supported data type. For example, the current implementation of `TypedAggregateExpression` requires serialization/de-serialization for each call of update or merge. *Proposal* We propose: 1. Introduces a TypedImperativeAggregate which allows using arbitrary java object as aggregation buffer, with requirements like: - It is flexible enough that the API allows using any java object as aggregation buffer, so that it is easier to integrate with existing Monoid libraries like algebird. - We don't need to call serialize/deserialize for each call of update/merge. Instead, only a few serialization/deserialization operations are needed. This is to guarantee theperformance. 2. Refactors `TypedAggregateExpression` to use this new interface, to get higher performance. 3. Implements Appro-Percentile and other aggregation functions which has a complex aggregation object with this new interface. was: *Background* For aggregation functions like sum and count, Spark-Sql internally use an aggregation buffer to store the intermediate aggregation result for all aggregation functions. Each aggregation function will occupy a section of aggregation buffer. *Problem* Currently, Spark-sql only allows a small set of Spark-Sql supported storage data types stored in aggregation buffer, which is not very convenient or performant, there are several typical cases like: 1. If the aggregation has a complex model CountMinSketch, it is not very easy to convert the complex model so that it can be stored with limited Spark-sql supported data types. 2. It is hard to reuse aggregation class definition defined in existing libraries like algebird. 3. It may introduces heavy serialization/deserialization cost when converting a domain model to Spark sql supported data type. For example, the current implementation of `TypedAggregateExpression` requires serialization/de-serialization for each call of update or merge. *Proposal* We propose: 1. Introduces a TypedImperativeAggregate which allows using arbitrary java object as aggregation buffer, with requirements like: a. It is flexible enough that the API allows using any java object as aggregation buffer, so that it is easier to integrate with existing Monoid libraries like algebird. b. We don't need to call serialize/deserialize for each call of update/merge. Instead, only a few serialization/deserialization operations are needed. This is to guarantee the performance. 2. Refactors `TypedAggregateExpression` to use this new interface, to get higher performance. 3. Implements Appro-Percentile and other aggregation functions which has a complex aggregation object with this new interface. > Support using arbitrary Java object as internal aggregation buffer object > - > > Key: SPARK-17187 > URL: https://issues.apache.org/jira/browse/SPARK-17187 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Sean Zhong > > *Background* > For aggregation functions like sum and count, Spark-Sql internally use an > aggregation buffer to store the intermediate aggregation result for all > aggregation functions. Each aggregation function will occupy a section of > aggregation buffer. > *Problem* > Currently, Spark-sql only allows a small set of Spark-Sql supported storage > data types stored in aggregation buffer, which is not very convenient or > performant, there are several typical cases like: > 1. If the aggregation has a complex model CountMinSketch, it is not very easy > to convert the complex model so that it can be stored with limited Spark-sql > supported data types. > 2. It is hard to reuse aggregation class definition defined in existing > libraries like algebird. > 3. It may introduces heavy serialization/deserialization cost w
[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
[ https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431018#comment-15431018 ] Andrew Davidson commented on SPARK-17172: - Hi Sean It should be very easy to use the attached notebook to reproduce the hive bug. I got the code example from a blog. The original code worked in spark 1.5.x I also attached an html version of the notebook so you can see the entire stack trace with out having to start jupyter thanks Andy > pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while > calling None.org.apache.spark.sql.hive.HiveContext. > -- > > Key: SPARK-17172 > URL: https://issues.apache.org/jira/browse/SPARK-17172 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.2 > Environment: spark version: 1.6.2 > python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct 5 2014, 20:42:22) > [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] >Reporter: Andrew Davidson > Attachments: hiveUDFBug.html, hiveUDFBug.ipynb > > > from pyspark.sql import HiveContext > sqlContext = HiveContext(sc) > # Define udf > from pyspark.sql.functions import udf > def scoreToCategory(score): > if score >= 80: return 'A' > elif score >= 60: return 'B' > elif score >= 35: return 'C' > else: return 'D' > > udfScoreToCategory=udf(scoreToCategory, StringType()) > throws exception > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: java.lang.RuntimeException: Unable to > instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
[ https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431030#comment-15431030 ] Sean Owen commented on SPARK-17172: --- It shows this error: You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly I think that's the cause. Did you build with hive support? (Despite the message I think the more direct way to do it is -Phive) > pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while > calling None.org.apache.spark.sql.hive.HiveContext. > -- > > Key: SPARK-17172 > URL: https://issues.apache.org/jira/browse/SPARK-17172 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.2 > Environment: spark version: 1.6.2 > python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct 5 2014, 20:42:22) > [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] >Reporter: Andrew Davidson > Attachments: hiveUDFBug.html, hiveUDFBug.ipynb > > > from pyspark.sql import HiveContext > sqlContext = HiveContext(sc) > # Define udf > from pyspark.sql.functions import udf > def scoreToCategory(score): > if score >= 80: return 'A' > elif score >= 60: return 'B' > elif score >= 35: return 'C' > else: return 'D' > > udfScoreToCategory=udf(scoreToCategory, StringType()) > throws exception > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: java.lang.RuntimeException: Unable to > instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object
[ https://issues.apache.org/jira/browse/SPARK-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431034#comment-15431034 ] Apache Spark commented on SPARK-17187: -- User 'clockfly' has created a pull request for this issue: https://github.com/apache/spark/pull/14753 > Support using arbitrary Java object as internal aggregation buffer object > - > > Key: SPARK-17187 > URL: https://issues.apache.org/jira/browse/SPARK-17187 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Sean Zhong > > *Background* > For aggregation functions like sum and count, Spark-Sql internally use an > aggregation buffer to store the intermediate aggregation result for all > aggregation functions. Each aggregation function will occupy a section of > aggregation buffer. > *Problem* > Currently, Spark-sql only allows a small set of Spark-Sql supported storage > data types stored in aggregation buffer, which is not very convenient or > performant, there are several typical cases like: > 1. If the aggregation has a complex model CountMinSketch, it is not very easy > to convert the complex model so that it can be stored with limited Spark-sql > supported data types. > 2. It is hard to reuse aggregation class definition defined in existing > libraries like algebird. > 3. It may introduces heavy serialization/deserialization cost when converting > a domain model to Spark sql supported data type. For example, the current > implementation of `TypedAggregateExpression` requires > serialization/de-serialization for each call of update or merge. > *Proposal* > We propose: > 1. Introduces a TypedImperativeAggregate which allows using arbitrary java > object as aggregation buffer, with requirements like: > - It is flexible enough that the API allows using any java object as > aggregation buffer, so that it is easier to integrate with existing Monoid > libraries like algebird. > - We don't need to call serialize/deserialize for each call of > update/merge. Instead, only a few serialization/deserialization operations > are needed. This is to guarantee theperformance. > 2. Refactors `TypedAggregateExpression` to use this new interface, to get > higher performance. > 3. Implements Appro-Percentile and other aggregation functions which has a > complex aggregation object with this new interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object
[ https://issues.apache.org/jira/browse/SPARK-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17187: Assignee: (was: Apache Spark) > Support using arbitrary Java object as internal aggregation buffer object > - > > Key: SPARK-17187 > URL: https://issues.apache.org/jira/browse/SPARK-17187 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Sean Zhong > > *Background* > For aggregation functions like sum and count, Spark-Sql internally use an > aggregation buffer to store the intermediate aggregation result for all > aggregation functions. Each aggregation function will occupy a section of > aggregation buffer. > *Problem* > Currently, Spark-sql only allows a small set of Spark-Sql supported storage > data types stored in aggregation buffer, which is not very convenient or > performant, there are several typical cases like: > 1. If the aggregation has a complex model CountMinSketch, it is not very easy > to convert the complex model so that it can be stored with limited Spark-sql > supported data types. > 2. It is hard to reuse aggregation class definition defined in existing > libraries like algebird. > 3. It may introduces heavy serialization/deserialization cost when converting > a domain model to Spark sql supported data type. For example, the current > implementation of `TypedAggregateExpression` requires > serialization/de-serialization for each call of update or merge. > *Proposal* > We propose: > 1. Introduces a TypedImperativeAggregate which allows using arbitrary java > object as aggregation buffer, with requirements like: > - It is flexible enough that the API allows using any java object as > aggregation buffer, so that it is easier to integrate with existing Monoid > libraries like algebird. > - We don't need to call serialize/deserialize for each call of > update/merge. Instead, only a few serialization/deserialization operations > are needed. This is to guarantee theperformance. > 2. Refactors `TypedAggregateExpression` to use this new interface, to get > higher performance. > 3. Implements Appro-Percentile and other aggregation functions which has a > complex aggregation object with this new interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17164) Query with colon in the table name fails to parse in 2.0
[ https://issues.apache.org/jira/browse/SPARK-17164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431031#comment-15431031 ] Sital Kedia commented on SPARK-17164: - Thanks [~rxin], [~hvanhovell], that makes sense. The issue is on our side. For Spark 1.6 we were using our internal version of Hive parser which supports table name with colon. Since Spark 2.0, no longer uses the Hive parser, this broke on our side. We will use backtick for such tables to workaround this issue. > Query with colon in the table name fails to parse in 2.0 > > > Key: SPARK-17164 > URL: https://issues.apache.org/jira/browse/SPARK-17164 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > Running a simple query with colon in table name fails to parse in 2.0 > {code} > == SQL == > SELECT * FROM a:b > ---^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682) > ... 48 elided > {code} > Please note that this is a regression from Spark 1.6 as the query runs fine > in 1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-17164) Query with colon in the table name fails to parse in 2.0
[ https://issues.apache.org/jira/browse/SPARK-17164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sital Kedia closed SPARK-17164. --- Resolution: Won't Fix > Query with colon in the table name fails to parse in 2.0 > > > Key: SPARK-17164 > URL: https://issues.apache.org/jira/browse/SPARK-17164 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > Running a simple query with colon in table name fails to parse in 2.0 > {code} > == SQL == > SELECT * FROM a:b > ---^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682) > ... 48 elided > {code} > Please note that this is a regression from Spark 1.6 as the query runs fine > in 1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object
[ https://issues.apache.org/jira/browse/SPARK-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17187: Assignee: Apache Spark > Support using arbitrary Java object as internal aggregation buffer object > - > > Key: SPARK-17187 > URL: https://issues.apache.org/jira/browse/SPARK-17187 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Sean Zhong >Assignee: Apache Spark > > *Background* > For aggregation functions like sum and count, Spark-Sql internally use an > aggregation buffer to store the intermediate aggregation result for all > aggregation functions. Each aggregation function will occupy a section of > aggregation buffer. > *Problem* > Currently, Spark-sql only allows a small set of Spark-Sql supported storage > data types stored in aggregation buffer, which is not very convenient or > performant, there are several typical cases like: > 1. If the aggregation has a complex model CountMinSketch, it is not very easy > to convert the complex model so that it can be stored with limited Spark-sql > supported data types. > 2. It is hard to reuse aggregation class definition defined in existing > libraries like algebird. > 3. It may introduces heavy serialization/deserialization cost when converting > a domain model to Spark sql supported data type. For example, the current > implementation of `TypedAggregateExpression` requires > serialization/de-serialization for each call of update or merge. > *Proposal* > We propose: > 1. Introduces a TypedImperativeAggregate which allows using arbitrary java > object as aggregation buffer, with requirements like: > - It is flexible enough that the API allows using any java object as > aggregation buffer, so that it is easier to integrate with existing Monoid > libraries like algebird. > - We don't need to call serialize/deserialize for each call of > update/merge. Instead, only a few serialization/deserialization operations > are needed. This is to guarantee theperformance. > 2. Refactors `TypedAggregateExpression` to use this new interface, to get > higher performance. > 3. Implements Appro-Percentile and other aggregation functions which has a > complex aggregation object with this new interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16593) Provide a pre-fetch mechanism to accelerate shuffle stage.
[ https://issues.apache.org/jira/browse/SPARK-16593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431070#comment-15431070 ] Biao Ma commented on SPARK-16593: - I had made new commits. > Provide a pre-fetch mechanism to accelerate shuffle stage. > -- > > Key: SPARK-16593 > URL: https://issues.apache.org/jira/browse/SPARK-16593 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Biao Ma >Priority: Minor > Labels: features > > Currently, the `NettyBlockRpcServer` will reading data through BlockManager, > while the block is not cached in memory, the data should be read from DISK > first, then into MEM. I wonder if we implement a mechanism add a message > contains the blockIds that the same as the openBlock message but one loop > ahead, then the `NettyBlockRpcServer ` will load the block ready for transfer > to the reduce side. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17188) Moves QuantileSummaries to project catalyst from sql so that it can be used to implement percentile_approx
[ https://issues.apache.org/jira/browse/SPARK-17188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17188: --- Description: org.apache.spark.sql.execution.stat > Moves QuantileSummaries to project catalyst from sql so that it can be used > to implement percentile_approx > -- > > Key: SPARK-17188 > URL: https://issues.apache.org/jira/browse/SPARK-17188 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Sean Zhong > > org.apache.spark.sql.execution.stat -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17188) Moves QuantileSummaries to project catalyst from sql so that it can be used to implement percentile_approx
Sean Zhong created SPARK-17188: -- Summary: Moves QuantileSummaries to project catalyst from sql so that it can be used to implement percentile_approx Key: SPARK-17188 URL: https://issues.apache.org/jira/browse/SPARK-17188 Project: Spark Issue Type: Sub-task Reporter: Sean Zhong -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17188) Moves QuantileSummaries to project catalyst from sql so that it can be used to implement percentile_approx
[ https://issues.apache.org/jira/browse/SPARK-17188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-17188: --- Description: QuantileSummaries is a useful utility class to do statistics. It can be used by aggregation function like percentile_approx. Currently, QuantileSummaries is located at project catalyst in package org.apache.spark.sql.execution.stat, probably, we should move it to project catalyst package org.apache.spark.sql.util. was:org.apache.spark.sql.execution.stat > Moves QuantileSummaries to project catalyst from sql so that it can be used > to implement percentile_approx > -- > > Key: SPARK-17188 > URL: https://issues.apache.org/jira/browse/SPARK-17188 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Sean Zhong > > QuantileSummaries is a useful utility class to do statistics. It can be used > by aggregation function like percentile_approx. > Currently, QuantileSummaries is located at project catalyst in package > org.apache.spark.sql.execution.stat, probably, we should move it to project > catalyst package org.apache.spark.sql.util. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16283) Implement percentile_approx SQL function
[ https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431136#comment-15431136 ] Sean Zhong commented on SPARK-16283: Created a sub-task to move QuantileSummaries to package org.apache.spark.sql.util of catalyst project > Implement percentile_approx SQL function > > > Key: SPARK-16283 > URL: https://issues.apache.org/jira/browse/SPARK-16283 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16283) Implement percentile_approx SQL function
[ https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431136#comment-15431136 ] Sean Zhong edited comment on SPARK-16283 at 8/22/16 4:35 PM: - Created a sub-task SPARK-17188 to move QuantileSummaries to package org.apache.spark.sql.util of catalyst project was (Author: clockfly): Created a sub-task to move QuantileSummaries to package org.apache.spark.sql.util of catalyst project > Implement percentile_approx SQL function > > > Key: SPARK-16283 > URL: https://issues.apache.org/jira/browse/SPARK-16283 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17110) Pyspark with locality ANY throw java.io.StreamCorruptedException
[ https://issues.apache.org/jira/browse/SPARK-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431139#comment-15431139 ] Jonathan Alvarado commented on SPARK-17110: --- Is there a workaround for this issue? I'm affected by it. > Pyspark with locality ANY throw java.io.StreamCorruptedException > > > Key: SPARK-17110 > URL: https://issues.apache.org/jira/browse/SPARK-17110 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 > Environment: Cluster of 2 AWS r3.xlarge nodes launched via ec2 > scripts, Spark 2.0.0, hadoop: yarn, pyspark shell >Reporter: Tomer Kaftan >Priority: Critical > > In Pyspark 2.0.0, any task that accesses cached data non-locally throws a > StreamCorruptedException like the stacktrace below: > {noformat} > WARN TaskSetManager: Lost task 7.0 in stage 2.0 (TID 26, 172.31.26.184): > java.io.StreamCorruptedException: invalid stream header: 12010A80 > at > java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:807) > at java.io.ObjectInputStream.(ObjectInputStream.java:302) > at > org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:63) > at > org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:63) > at > org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:122) > at > org.apache.spark.serializer.SerializerManager.dataDeserializeStream(SerializerManager.scala:146) > at > org.apache.spark.storage.BlockManager$$anonfun$getRemoteValues$1.apply(BlockManager.scala:524) > at > org.apache.spark.storage.BlockManager$$anonfun$getRemoteValues$1.apply(BlockManager.scala:522) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:522) > at org.apache.spark.storage.BlockManager.get(BlockManager.scala:609) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:661) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The simplest way I have found to reproduce this is by running the following > code in the pyspark shell, on a cluster of 2 nodes set to use only one worker > core each: > {code} > x = sc.parallelize([1, 1, 1, 1, 1, 1000, 1, 1, 1], numSlices=9).cache() > x.count() > import time > def waitMap(x): > time.sleep(x) > return x > x.map(waitMap).count() > {code} > Or by running the following via spark-submit: > {code} > from pyspark import SparkContext > sc = SparkContext() > x = sc.parallelize([1, 1, 1, 1, 1, 1000, 1, 1, 1], numSlices=9).cache() > x.count() > import time > def waitMap(x): > time.sleep(x) > return x > x.map(waitMap).count() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17188) Moves QuantileSummaries to project catalyst from sql so that it can be used to implement percentile_approx
[ https://issues.apache.org/jira/browse/SPARK-17188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431179#comment-15431179 ] Apache Spark commented on SPARK-17188: -- User 'clockfly' has created a pull request for this issue: https://github.com/apache/spark/pull/14754 > Moves QuantileSummaries to project catalyst from sql so that it can be used > to implement percentile_approx > -- > > Key: SPARK-17188 > URL: https://issues.apache.org/jira/browse/SPARK-17188 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Sean Zhong > > QuantileSummaries is a useful utility class to do statistics. It can be used > by aggregation function like percentile_approx. > Currently, QuantileSummaries is located at project catalyst in package > org.apache.spark.sql.execution.stat, probably, we should move it to project > catalyst package org.apache.spark.sql.util. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17188) Moves QuantileSummaries to project catalyst from sql so that it can be used to implement percentile_approx
[ https://issues.apache.org/jira/browse/SPARK-17188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17188: Assignee: (was: Apache Spark) > Moves QuantileSummaries to project catalyst from sql so that it can be used > to implement percentile_approx > -- > > Key: SPARK-17188 > URL: https://issues.apache.org/jira/browse/SPARK-17188 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Sean Zhong > > QuantileSummaries is a useful utility class to do statistics. It can be used > by aggregation function like percentile_approx. > Currently, QuantileSummaries is located at project catalyst in package > org.apache.spark.sql.execution.stat, probably, we should move it to project > catalyst package org.apache.spark.sql.util. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17188) Moves QuantileSummaries to project catalyst from sql so that it can be used to implement percentile_approx
[ https://issues.apache.org/jira/browse/SPARK-17188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17188: Assignee: Apache Spark > Moves QuantileSummaries to project catalyst from sql so that it can be used > to implement percentile_approx > -- > > Key: SPARK-17188 > URL: https://issues.apache.org/jira/browse/SPARK-17188 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Sean Zhong >Assignee: Apache Spark > > QuantileSummaries is a useful utility class to do statistics. It can be used > by aggregation function like percentile_approx. > Currently, QuantileSummaries is located at project catalyst in package > org.apache.spark.sql.execution.stat, probably, we should move it to project > catalyst package org.apache.spark.sql.util. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17189) [MINOR] Looses the interface from UnsafeRow to InternalRow in AggregationIterator if UnsafeRow specific method is not used
Sean Zhong created SPARK-17189: -- Summary: [MINOR] Looses the interface from UnsafeRow to InternalRow in AggregationIterator if UnsafeRow specific method is not used Key: SPARK-17189 URL: https://issues.apache.org/jira/browse/SPARK-17189 Project: Spark Issue Type: Bug Reporter: Sean Zhong Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org