[jira] [Commented] (SPARK-10781) Allow certain number of failed tasks and allow job to succeed
[ https://issues.apache.org/jira/browse/SPARK-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16421558#comment-16421558 ] Fei Niu commented on SPARK-10781: - This can be a very useful feature. For example, if your sequence file format itself is bad, currently there is no way to catch the exception and move on. It makes some data set not able to process. > Allow certain number of failed tasks and allow job to succeed > - > > Key: SPARK-10781 > URL: https://issues.apache.org/jira/browse/SPARK-10781 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: Thomas Graves >Priority: Major > > MapReduce has this config mapreduce.map.failures.maxpercent and > mapreduce.reduce.failures.maxpercent which allows for a certain percent of > tasks to fail but the job to still succeed. > This could be a useful feature in Spark also if a job doesn't need all the > tasks to be successful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22393) spark-shell can't find imported types in class constructors, extends clause
[ https://issues.apache.org/jira/browse/SPARK-22393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16421548#comment-16421548 ] Joe Pallas commented on SPARK-22393: The changes that were imported in [https://github.com/apache/spark/pull/19846] don't seem to cover all the cases that the Scala 2.12 changes covered. To be specific, this sequence: {code} import scala.reflect.runtime.{universe => ru} import ru.TypeTag class C[T: TypeTag](value: T) {code} works correctly in Scala 2.12 with -Yrepl-class-based, but does not work in spark-shell 2.3.0. I don't understand the import-handling code enough to understand the problem, however. It figures out that it needs the import for TypeTag but it doesn't recognize that the import depends on the previous import: {noformat} :9: error: not found: value ru import ru.TypeTag ^ {noformat} > spark-shell can't find imported types in class constructors, extends clause > --- > > Key: SPARK-22393 > URL: https://issues.apache.org/jira/browse/SPARK-22393 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.2, 2.1.2, 2.2.0 >Reporter: Ryan Williams >Assignee: Mark Petruska >Priority: Minor > Fix For: 2.3.0 > > > {code} > $ spark-shell > … > scala> import org.apache.spark.Partition > import org.apache.spark.Partition > scala> class P(p: Partition) > :11: error: not found: type Partition >class P(p: Partition) > ^ > scala> class P(val index: Int) extends Partition > :11: error: not found: type Partition >class P(val index: Int) extends Partition >^ > {code} > Any class that I {{import}} gives "not found: type ___" when used as a > parameter to a class, or in an extends clause; this applies to classes I > import from JARs I provide via {{--jars}} as well as core Spark classes as > above. > This worked in 1.6.3 but has been broken since 2.0.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12216) Spark failed to delete temp directory
[ https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16421530#comment-16421530 ] Kingsley Jones edited comment on SPARK-12216 at 4/1/18 12:21 AM: - Same issue under Windows 10 and Windows Server 2016 using Java 1.8, Spark 2.2.1, Hadoop 2.7 My tests support the contention of [~IgorBabalich] ... it seems that classloaders instantiated by the code are not ever being closed. On *nix this is not a problem since the files are not locked. However, on windows the files are locked. In addition to the resources mentioned by Igor this Oracle bug fix from Java 7 seems relevant: [https://docs.oracle.com/javase/7/docs/technotes/guides/net/ClassLoader.html] A new method "close()" was introduced to address the problem, which shows up on Windows due to the differing treatment of file locks between the Windows file system and *nix file system. I would point out that this is a generic java issue which breaks the cross-platform intention of that platform as a whole. The Oracle blog also contains a post: [https://blogs.oracle.com/corejavatechtips/closing-a-urlclassloader] I have been searching the Apache Spark code-base for classloader instances, in search of any ".close()" action. I could not find any, so I believe [~IgorBabalich] is correct - the issue has to do with classloaders not being closed. I would fix it myself, but thusfar it is not clear to me *when* the classloader needs to be closed. That is just ignorance on my part. The question is whether the classloader should be closed when still available as variable at the point where it has been instantiated, or later during the ShutdownHookManger cleanup. If the latter, then it was not clear to me how to actually get a list of open class loaders. That is where I am at so far. I am prepared to put some work into this, but I need some help from those who know the codebase to help answer the above question - maybe with a well-isolated test. MY TESTS... This issue has been around in one form or another for at least four years and shows up on many threads. The standard answer is that it is a "permissions issue" to do with Windows. That assertion is objectively false. There is simple test to prove it. At a windows prompt, start spark-shell C:\spark\spark-shell then get the temp file directory: scala> sc.getConf.get("spark.repl.class.outputDir") it will be in %AppData%\Local\Temp tree e.g. C:\Users\kings\AppData\Local\Temp\spark-d67b262e-f6c8-43d7-8790-731308497f02\repl-4cc87dce-8608-4643-b869-b0287ac4571f where the last file name has GUID that changes in each iteration. With the spark session still open, go to the Temp directory and try to delete the given directory. You won't be able to... there is a lock on it. Now issue scala> :quit to quit the session. The stack trace will show that ShutdownHookManager tried to delete the directory above but could not. If you now try and delete it through the file system you can. This is because the JVM actually cleans up the locks on exit. So, it is not a permission issue, but a feature of the Windows treatment of file locks. This is the *known issue* that was addressed in the Java bug fix through introduction of a Closeable interface close method for URLClassLoader. It was fixed there since many enterprise systems run on Windows. Now... to further test the cause, I used the Windows Linux Subsytem. To acces this (post install) you run C:> bash from a command prompt. In order to get this to work, I used the same spark install, but had to install a fresh copy of jdk on ubuntu within the Windows bash subsystem. This is standard ubuntu stuff, but the path to your windows c drive is /mnt/c If I rerun the same test, the new output of scala> sc.getConf.get("spark.repl.class.outputDir") will be a different folder location under Linux /tmp but with the same setup. With the spark session still active it is possible to delete the spark folders in the /tmp folder *while the session is still active*. This is the difference between Windows and Linux. While bash is running Ubuntu on Windows, it has the different file locking behaviour which means you can delete the spark temp folders while a session is running. If you run through a new session with spark-shell at the linux prompt and issue :quit it will shutdown without any stacktrace error from ShutdownHookManger. So, my conclusions are as follows: 1) this is not a permissions issue as per the common assertion 2) it is a Windows specific problem for *known* reasons - namely the difference on file-locking as compared with Linux 3) it was considered a *bug* in the Java ecosystem and was fixed as such from Java 1.7 with the .close() method Further... People who need to run Spark on windows infrastructure (like me) can either run a docker container or use the windo
[jira] [Comment Edited] (SPARK-12216) Spark failed to delete temp directory
[ https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16421530#comment-16421530 ] Kingsley Jones edited comment on SPARK-12216 at 4/1/18 12:21 AM: - Same issue under Windows 10 and Windows Server 2016 using Java 1.8, Spark 2.2.1, Hadoop 2.7 My tests support the contention of [~IgorBabalich]Igor Bablich... it seems that classloaders instantiated by the code are not ever being closed. On *nix this is not a problem since the files are not locked. However, on windows the files are locked. In addition to the resources mentioned by Igor this Oracle bug fix from Java 7 seems relevant: [https://docs.oracle.com/javase/7/docs/technotes/guides/net/ClassLoader.html] A new method "close()" was introduced to address the problem, which shows up on Windows due to the differing treatment of file locks between the Windows file system and *nix file system. I would point out that this is a generic java issue which breaks the cross-platform intention of that platform as a whole. The Oracle blog also contains a post: [https://blogs.oracle.com/corejavatechtips/closing-a-urlclassloader] I have been searching the Apache Spark code-base for classloader instances, in search of any ".close()" action. I could not find any, so I believe [~IgorBabalich] is correct - the issue has to do with classloaders not being closed. I would fix it myself, but thusfar it is not clear to me *when* the classloader needs to be closed. That is just ignorance on my part. The question is whether the classloader should be closed when still available as variable at the point where it has been instantiated, or later during the ShutdownHookManger cleanup. If the latter, then it was not clear to me how to actually get a list of open class loaders. That is where I am at so far. I am prepared to put some work into this, but I need some help from those who know the codebase to help answer the above question - maybe with a well-isolated test. MY TESTS... This issue has been around in one form or another for at least four years and shows up on many threads. The standard answer is that it is a "permissions issue" to do with Windows. That assertion is objectively false. There is simple test to prove it. At a windows prompt, start spark-shell C:\spark\spark-shell then get the temp file directory: scala> sc.getConf.get("spark.repl.class.outputDir") it will be in %AppData%\Local\Temp tree e.g. C:\Users\kings\AppData\Local\Temp\spark-d67b262e-f6c8-43d7-8790-731308497f02\repl-4cc87dce-8608-4643-b869-b0287ac4571f where the last file name has GUID that changes in each iteration. With the spark session still open, go to the Temp directory and try to delete the given directory. You won't be able to... there is a lock on it. Now issue scala> :quit to quit the session. The stack trace will show that ShutdownHookManager tried to delete the directory above but could not. If you now try and delete it through the file system you can. This is because the JVM actually cleans up the locks on exit. So, it is not a permission issue, but a feature of the Windows treatment of file locks. This is the *known issue* that was addressed in the Java bug fix through introduction of a Closeable interface close method for URLClassLoader. It was fixed there since many enterprise systems run on Windows. Now... to further test the cause, I used the Windows Linux Subsytem. To acces this (post install) you run C:> bash from a command prompt. In order to get this to work, I used the same spark install, but had to install a fresh copy of jdk on ubuntu within the Windows bash subsystem. This is standard ubuntu stuff, but the path to your windows c drive is /mnt/c If I rerun the same test, the new output of scala> sc.getConf.get("spark.repl.class.outputDir") will be a different folder location under Linux /tmp but with the same setup. With the spark session still active it is possible to delete the spark folders in the /tmp folder *while the session is still active*. This is the difference between Windows and Linux. While bash is running Ubuntu on Windows, it has the different file locking behaviour which means you can delete the spark temp folders while a session is running. If you run through a new session with spark-shell at the linux prompt and issue :quit it will shutdown without any stacktrace error from ShutdownHookManger. So, my conclusions are as follows: 1) this is not a permissions issue as per the common assertion 2) it is a Windows specific problem for *known* reasons - namely the difference on file-locking as compared with Linux 3) it was considered a *bug* in the Java ecosystem and was fixed as such from Java 1.7 with the .close() method Further... People who need to run Spark on windows infrastructure (like me) can either run a docker container or us
[jira] [Commented] (SPARK-12216) Spark failed to delete temp directory
[ https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16421530#comment-16421530 ] Kingsley Jones commented on SPARK-12216: Same issue under Windows 10 and Windows Server 2016 using Java 1.8, Spark 2.2.1, Hadoop 2.7 My tests support the contention of Igor Bablich... it seems that classloaders instantiated by the code are not ever being closed. On *nix this is not a problem since the files are not locked. However, on windows the files are locked. In addition to the resources mentioned by Igor this Oracle bug fix from Java 7 seems relevant: [https://docs.oracle.com/javase/7/docs/technotes/guides/net/ClassLoader.html] A new method "close()" was introduced to address the problem, which shows up on Windows due to the differing treatment of file locks between the Windows file system and *nix file system. I would point out that this is a generic java issue which breaks the cross-platform intention of that platform as a whole. The Oracle blog also contains a post: [https://blogs.oracle.com/corejavatechtips/closing-a-urlclassloader] I have been searching the Apache Spark code-base for classloader instances, in search of any ".close()" action. I could not find any, so I believe [~IgorBabalich] is correct - the issue has to do with classloaders not being closed. I would fix it myself, but thusfar it is not clear to me *when* the classloader needs to be closed. That is just ignorance on my part. The question is whether the classloader should be closed when still available as variable at the point where it has been instantiated, or later during the ShutdownHookManger cleanup. If the latter, then it was not clear to me how to actually get a list of open class loaders. That is where I am at so far. I am prepared to put some work into this, but I need some help from those who know the codebase to help answer the above question - maybe with a well-isolated test. MY TESTS... This issue has been around in one form or another for at least four years and shows up on many threads. The standard answer is that it is a "permissions issue" to do with Windows. That assertion is objectively false. There is simple test to prove it. At a windows prompt, start spark-shell C:\spark\spark-shell then get the temp file directory: scala> sc.getConf.get("spark.repl.class.outputDir") it will be in %AppData%\Local\Temp tree e.g. C:\Users\kings\AppData\Local\Temp\spark-d67b262e-f6c8-43d7-8790-731308497f02\repl-4cc87dce-8608-4643-b869-b0287ac4571f where the last file name has GUID that changes in each iteration. With the spark session still open, go to the Temp directory and try to delete the given directory. You won't be able to... there is a lock on it. Now issue scala> :quit to quit the session. The stack trace will show that ShutdownHookManager tried to delete the directory above but could not. If you now try and delete it through the file system you can. This is because the JVM actually cleans up the locks on exit. So, it is not a permission issue, but a feature of the Windows treatment of file locks. This is the *known issue* that was addressed in the Java bug fix through introduction of a Closeable interface close method for URLClassLoader. It was fixed there since many enterprise systems run on Windows. Now... to further test the cause, I used the Windows Linux Subsytem. To acces this (post install) you run C:> bash from a command prompt. In order to get this to work, I used the same spark install, but had to install a fresh copy of jdk on ubuntu within the Windows bash subsystem. This is standard ubuntu stuff, but the path to your windows c drive is /mnt/c If I rerun the same test, the new output of scala> sc.getConf.get("spark.repl.class.outputDir") will be a different folder location under Linux /tmp but with the same setup. With the spark session still active it is possible to delete the spark folders in the /tmp folder *while the session is still active*. This is the difference between Windows and Linux. While bash is running Ubuntu on Windows, it has the different file locking behaviour which means you can delete the spark temp folders while a session is running. If you run through a new session with spark-shell at the linux prompt and issue :quit it will shutdown without any stacktrace error from ShutdownHookManger. So, my conclusions are as follows: 1) this is not a permissions issue as per the common assertion 2) it is a Windows specific problem for *known* reasons - namely the difference on file-locking as compared with Linux 3) it was considered a *bug* in the Java ecosystem and was fixed as such from Java 1.7 with the .close() method Further... People who need to run Spark on windows infrastructure (like me) can either run a docker container or use the windows linux subsystem to launch processes. So we do hav
[jira] [Updated] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark
[ https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-19842: Target Version/s: 2.4.0, 3.0.0 > Informational Referential Integrity Constraints Support in Spark > > > Key: SPARK-19842 > URL: https://issues.apache.org/jira/browse/SPARK-19842 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Ioana Delaney >Priority: Major > Attachments: InformationalRIConstraints.doc > > > *Informational Referential Integrity Constraints Support in Spark* > This work proposes support for _informational primary key_ and _foreign key > (referential integrity) constraints_ in Spark. The main purpose is to open up > an area of query optimization techniques that rely on referential integrity > constraints semantics. > An _informational_ or _statistical constraint_ is a constraint such as a > _unique_, _primary key_, _foreign key_, or _check constraint_, that can be > used by Spark to improve query performance. Informational constraints are not > enforced by the Spark SQL engine; rather, they are used by Catalyst to > optimize the query processing. They provide semantics information that allows > Catalyst to rewrite queries to eliminate joins, push down aggregates, remove > unnecessary Distinct operations, and perform a number of other optimizations. > Informational constraints are primarily targeted to applications that load > and analyze data that originated from a data warehouse. For such > applications, the conditions for a given constraint are known to be true, so > the constraint does not need to be enforced during data load operations. > The attached document covers constraint definition, metastore storage, > constraint validation, and maintenance. The document shows many examples of > query performance improvements that utilize referential integrity constraints > and can be implemented in Spark. > Link to the google doc: > [InformationalRIConstraints|https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23791) Sub-optimal generated code for sum aggregating
[ https://issues.apache.org/jira/browse/SPARK-23791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16421407#comment-16421407 ] Valentin Nikotin commented on SPARK-23791: -- Hi Marco Gaido, I tested performance with 2.2 and 2.3, there was not really difference between these 2 versions. 1. I am running test right now to get timings for 1 to 105 columns (gcp dataproc 1+4 nodes cluster with Spark 2.2). I've noticed that before and confirmed now: * 1-13 cols, wholeStage runs faster * starting from 14 cols timing jumped ~4-5 times, while with disabled wholeStage it's linearly growing. * 84 - 99 -- the error above * 100+ runs the same timings 2. I have not tried current master yet. I would like test it next week. I actually found very similar issue, but without too much details: [link title|https://issues.apache.org/jira/browse/SPARK-20479] > Sub-optimal generated code for sum aggregating > -- > > Key: SPARK-23791 > URL: https://issues.apache.org/jira/browse/SPARK-23791 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.2.0, 2.3.0 >Reporter: Valentin Nikotin >Priority: Major > Labels: performance > Original Estimate: 24h > Remaining Estimate: 24h > > It appears to be that with wholeStage codegen enabled simple spark job > performing sum aggregation of 50 columns runs ~4 timer slower than without > wholeStage codegen. > Please check test case code. Please note that udf is only to prevent > elimination optimizations that could be applied to literals. > {code:scala} > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.{Column, DataFrame, SparkSession} > import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_CODEGEN_ENABLED > object SPARK_23791 { > def main(args: Array[String]): Unit = { > val spark = SparkSession > .builder() > .master("local[4]") > .appName("test") > .getOrCreate() > def addConstColumns(prefix: String, cnt: Int, value: Column)(inputDF: > DataFrame) = > (0 until cnt).foldLeft(inputDF)((df, idx) => > df.withColumn(s"$prefix$idx", value)) > val dummy = udf(() => Option.empty[Int]) > def test(cnt: Int = 50, rows: Int = 500, grps: Int = 1000): Double = { > val t0 = System.nanoTime() > spark.range(rows).toDF() > .withColumn("grp", col("id").mod(grps)) > .transform(addConstColumns("null_", cnt, dummy())) > .groupBy("grp") > .agg(sum("null_0"), (1 until cnt).map(idx => sum(s"null_$idx")): _*) > .collect() > val t1 = System.nanoTime() > (t1 - t0) / 1e9 > } > val timings = for (i <- 1 to 3) yield { > spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, true) > val with_wholestage = test() > spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, false) > val without_wholestage = test() > (with_wholestage, without_wholestage) > } > timings.foreach(println) > println("Press enter ...") > System.in.read() > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23837) Create table as select gives exception if the spark generated alias name contains comma
[ https://issues.apache.org/jira/browse/SPARK-23837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shahid K I updated SPARK-23837: --- Priority: Minor (was: Major) > Create table as select gives exception if the spark generated alias name > contains comma > --- > > Key: SPARK-23837 > URL: https://issues.apache.org/jira/browse/SPARK-23837 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Shahid K I >Priority: Minor > > For spark generated alias name contains comma, Hive metastore throws > exception. > > 0: jdbc:hive2://ha-cluster/default> create table a (col1 decimal(18,3), col2 > decimal(18,5)); > +---++ > |Result| > +---++ > +---++ > No rows selected (0.171 seconds) > 0: jdbc:hive2://ha-cluster/default> select col1*col2 from a; > > +---+ > |(CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5)))| > +---+ > > +---+ > No rows selected (0.168 seconds) > 0: jdbc:hive2://ha-cluster/default> create table b as select col1*col2 from > a; > Error: org.apache.spark.sql.AnalysisException: Cannot create a table having a > column whose name contains commas in Hive metastore. Table: `default`.`b`; > Column: (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))); > (state=,code=0) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23837) Create table as select gives exception if the spark generated alias name contains comma
[ https://issues.apache.org/jira/browse/SPARK-23837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shahid K I updated SPARK-23837: --- Description: For spark generated alias name contains comma, Hive metastore throws exception. 0: jdbc:hive2://ha-cluster/default> create table a (col1 decimal(18,3), col2 decimal(18,5)); +---++ |Result| +---++ +---++ No rows selected (0.171 seconds) 0: jdbc:hive2://ha-cluster/default> select col1*col2 from a; +---+ |(CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5)))| +---+ +---+ No rows selected (0.168 seconds) 0: jdbc:hive2://ha-cluster/default> create table b as select col1*col2 from a; Error: org.apache.spark.sql.AnalysisException: Cannot create a table having a column whose name contains commas in Hive metastore. Table: `default`.`b`; Column: (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))); (state=,code=0) was: For spark generated alias name contains comma, Hive metastore throws exception. 0: jdbc:hive2://ha-cluster/default> create table a (col1 decimal(18,3), col2 decimal(18,5)); +-+--+ | Result | +-+--+ +-+--+ No rows selected (0.171 seconds) 0: jdbc:hive2://ha-cluster/default> select col1*col2 from a; +---+ | (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))) | +---+ +---+ No rows selected (0.168 seconds) 0: jdbc:hive2://ha-cluster/default> create table b as select col1*col2 from a; Error: org.apache.spark.sql.AnalysisException: Cannot create a table having a column whose name contains commas in Hive metastore. Table: `default`.`b`; Column: (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))); (state=,code=0) > Create table as select gives exception if the spark generated alias name > contains comma > --- > > Key: SPARK-23837 > URL: https://issues.apache.org/jira/browse/SPARK-23837 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Shahid K I >Priority: Major > > For spark generated alias name contains comma, Hive metastore throws > exception. > > 0: jdbc:hive2://ha-cluster/default> create table a (col1 decimal(18,3), col2 > decimal(18,5)); > +---++ > |Result| > +---++ > +---++ > No rows selected (0.171 seconds) > 0: jdbc:hive2://ha-cluster/default> select col1*col2 from a; > > +---+ > |(CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5)))| > +---+ > > +---+ > No rows selected (0.168 seconds) > 0: jdbc:hive2://ha-cluster/default> create table b as select col1*col2 from > a; > Error: org.apache.spark.sql.AnalysisException: Cannot create a table having a > column whose name contains commas in Hive metastore. Table: `default`.`b`; > Column: (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))); > (state=,code=0) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23837) Create table as select gives exception if the spark generated alias name contains comma
[ https://issues.apache.org/jira/browse/SPARK-23837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shahid K I updated SPARK-23837: --- Description: For spark generated alias name contains comma, Hive metastore throws exception. 0: jdbc:hive2://ha-cluster/default> create table a (col1 decimal(18,3), col2 decimal(18,5)); +-+--+ | Result | +-+--+ +-+--+ No rows selected (0.171 seconds) 0: jdbc:hive2://ha-cluster/default> select col1*col2 from a; +---+ | (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))) | +---+ +---+ No rows selected (0.168 seconds) 0: jdbc:hive2://ha-cluster/default> create table b as select col1*col2 from a; Error: org.apache.spark.sql.AnalysisException: Cannot create a table having a column whose name contains commas in Hive metastore. Table: `default`.`b`; Column: (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))); (state=,code=0) was: For spark generated alias name contains comma, Hive metastore throws exception. 0: jdbc:hive2://ha-cluster/default> create table a (col1 decimal(18,3), col2 decimal(18,5)); +-+--+ | Result | +-+--+ +-+--+ No rows selected (0.171 seconds) 0: jdbc:hive2://ha-cluster/default> select col1*col2 from a; +---+ | (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))) | +---+ +---+ No rows selected (0.168 seconds) 0: jdbc:hive2://ha-cluster/default> create table b as select col1*col2 from a; Error: org.apache.spark.sql.AnalysisException: Cannot create a table having a column whose name contains commas in Hive metastore. Table: `default`.`b`; Column: (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))); (state=,code=0) > Create table as select gives exception if the spark generated alias name > contains comma > --- > > Key: SPARK-23837 > URL: https://issues.apache.org/jira/browse/SPARK-23837 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Shahid K I >Priority: Major > > For spark generated alias name contains comma, Hive metastore throws > exception. > 0: jdbc:hive2://ha-cluster/default> create table a (col1 decimal(18,3), col2 > decimal(18,5)); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.171 seconds) > 0: jdbc:hive2://ha-cluster/default> select col1*col2 from a; > +---+ > | (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))) | > +---+ > +---+ > No rows selected (0.168 seconds) > 0: jdbc:hive2://ha-cluster/default> create table b as select col1*col2 from a; > Error: org.apache.spark.sql.AnalysisException: Cannot create a table having a > column whose name contains commas in Hive metastore. Table: `default`.`b`; > Column: (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))); > (state=,code=0) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23837) Create table as select gives exception if the spark generated alias name contains comma
[ https://issues.apache.org/jira/browse/SPARK-23837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shahid K I updated SPARK-23837: --- Description: For spark generated alias name contains comma, Hive metastore throws exception. 0: jdbc:hive2://ha-cluster/default> create table a (col1 decimal(18,3), col2 decimal(18,5)); +-+--+ | Result | +-+--+ +-+--+ No rows selected (0.171 seconds) 0: jdbc:hive2://ha-cluster/default> select col1*col2 from a; +---+ | (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))) | +---+ +---+ No rows selected (0.168 seconds) 0: jdbc:hive2://ha-cluster/default> create table b as select col1*col2 from a; Error: org.apache.spark.sql.AnalysisException: Cannot create a table having a column whose name contains commas in Hive metastore. Table: `default`.`b`; Column: (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))); (state=,code=0) was: For spark generated alias name contains comma, Hive metastore throws exception. 0: jdbc:hive2://ha-cluster/default> create table a (col1 decimal(18,3), col2 decimal(18,5)); +-+--+ | Result | +-+--+ +-+--+ No rows selected (0.171 seconds) 0: jdbc:hive2://ha-cluster/default> select col1*col2 from a; +--+--+ | (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))) | +--+--+ +--+--+ No rows selected (0.168 seconds) 0: jdbc:hive2://ha-cluster/default> create table b as select col1*col2 from a; *Error: org.apache.spark.sql.AnalysisException: Cannot create a table having a column whose name contains commas in Hive me tastore. Table: `default`.`b`; Column: (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))); (state=,code=0) * !image-2018-03-31-19-57-38-496.png! > Create table as select gives exception if the spark generated alias name > contains comma > --- > > Key: SPARK-23837 > URL: https://issues.apache.org/jira/browse/SPARK-23837 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Shahid K I >Priority: Major > > For spark generated alias name contains comma, Hive metastore throws > exception. > 0: jdbc:hive2://ha-cluster/default> create table a (col1 decimal(18,3), col2 > decimal(18,5)); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.171 seconds) > 0: jdbc:hive2://ha-cluster/default> select col1*col2 from a; > +---+ > | (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))) | > +---+ > +---+ > No rows selected (0.168 seconds) > 0: jdbc:hive2://ha-cluster/default> create table b as select col1*col2 from a; > Error: org.apache.spark.sql.AnalysisException: Cannot create a table having a > column whose name contains commas in Hive metastore. Table: `default`.`b`; > Column: (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))); > (state=,code=0) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23837) Create table as select gives exception if the spark generated alias name contains comma
[ https://issues.apache.org/jira/browse/SPARK-23837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shahid K I updated SPARK-23837: --- Description: For spark generated alias name contains comma, Hive metastore throws exception. 0: jdbc:hive2://ha-cluster/default> create table a (col1 decimal(18,3), col2 decimal(18,5)); +-+--+ | Result | +-+--+ +-+--+ No rows selected (0.171 seconds) 0: jdbc:hive2://ha-cluster/default> select col1*col2 from a; +--+--+ | (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))) | +--+--+ +--+--+ No rows selected (0.168 seconds) 0: jdbc:hive2://ha-cluster/default> create table b as select col1*col2 from a; *Error: org.apache.spark.sql.AnalysisException: Cannot create a table having a column whose name contains commas in Hive me tastore. Table: `default`.`b`; Column: (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))); (state=,code=0) * !image-2018-03-31-19-57-38-496.png! was: For spark generated alias name contains comma, Hive metastore throws exception. !image-2018-03-31-19-57-38-496.png! > Create table as select gives exception if the spark generated alias name > contains comma > --- > > Key: SPARK-23837 > URL: https://issues.apache.org/jira/browse/SPARK-23837 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Shahid K I >Priority: Major > > For spark generated alias name contains comma, Hive metastore throws > exception. > 0: jdbc:hive2://ha-cluster/default> create table a (col1 decimal(18,3), col2 > decimal(18,5)); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.171 seconds) > 0: jdbc:hive2://ha-cluster/default> select col1*col2 from a; > +--+--+ > | (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS DECIMAL(20,5))) | > +--+--+ > +--+--+ > No rows selected (0.168 seconds) > 0: jdbc:hive2://ha-cluster/default> create table b as select col1*col2 from a; > *Error: org.apache.spark.sql.AnalysisException: Cannot create a table having > a column whose name contains commas in Hive me > tastore. Table: > `default`.`b`; Column: (CAST(col1 AS DECIMAL(20,5)) * CAST(col2 AS > DECIMAL(20,5))); (state=,code=0) > * > !image-2018-03-31-19-57-38-496.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23661) Implement treeAggregate on Dataset API
[ https://issues.apache.org/jira/browse/SPARK-23661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16421348#comment-16421348 ] Liang-Chi Hsieh commented on SPARK-23661: - For the implementation of {{Dataset.treeAggregate}}, I'm thinking if we need to support SQL tree aggregate for all cases. For example, {{RDD.treeAggregate}} can be seen as grouping without keys. This is the case tree aggregation can benefit. For grouping by keys, I'm wondering if it really performs much better than non tree aggregation. cc [~cloud_fan] > Implement treeAggregate on Dataset API > -- > > Key: SPARK-23661 > URL: https://issues.apache.org/jira/browse/SPARK-23661 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > Many algorithms in MLlib are still not migrated their internal computing > workload from {{RDD}} to {{DataFrame}}. {{treeAggregate}} is one of obstacles > we need to address in order to see complete migration. > This ticket is opened to provide {{treeAggregate}} on Dataset API. For now > this should be a private API used by ML component. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23835) When Dataset.as converts column from nullable to non-nullable type, null Doubles are converted silently to -1
[ https://issues.apache.org/jira/browse/SPARK-23835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16421346#comment-16421346 ] Liang-Chi Hsieh commented on SPARK-23835: - What is the better behavior it should have? > When Dataset.as converts column from nullable to non-nullable type, null > Doubles are converted silently to -1 > - > > Key: SPARK-23835 > URL: https://issues.apache.org/jira/browse/SPARK-23835 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Priority: Major > > I constructed a DataFrame with a nullable java.lang.Double column (and an > extra Double column). I then converted it to a Dataset using ```as[(Double, > Double)]```. When the Dataset is shown, it has a null. When it is collected > and printed, the null is silently converted to a -1. > Code snippet to reproduce this: > {code} > val localSpark = spark > import localSpark.implicits._ > val df = Seq[(java.lang.Double, Double)]( > (1.0, 2.0), > (3.0, 4.0), > (Double.NaN, 5.0), > (null, 6.0) > ).toDF("a", "b") > df.show() // OUTPUT 1: has null > df.printSchema() > val data = df.as[(Double, Double)] > data.show() // OUTPUT 2: has null > data.collect().foreach(println) // OUTPUT 3: has -1 > {code} > OUTPUT 1 and 2: > {code} > ++---+ > | a| b| > ++---+ > | 1.0|2.0| > | 3.0|4.0| > | NaN|5.0| > |null|6.0| > ++---+ > {code} > OUTPUT 3: > {code} > (1.0,2.0) > (3.0,4.0) > (NaN,5.0) > (-1.0,6.0) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23837) Create table as select gives exception if the spark generated alias name contains comma
Shahid K I created SPARK-23837: -- Summary: Create table as select gives exception if the spark generated alias name contains comma Key: SPARK-23837 URL: https://issues.apache.org/jira/browse/SPARK-23837 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0, 2.2.1 Reporter: Shahid K I For spark generated alias name contains comma, Hive metastore throws exception. !image-2018-03-31-19-57-38-496.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19826) spark.ml Python API for PIC
[ https://issues.apache.org/jira/browse/SPARK-19826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16421215#comment-16421215 ] Huaxin Gao commented on SPARK-19826: I coded the PIC python API based on the changes in SPARK-15784 Add Power Iteration Clustering to spark.ml ([https://github.com/apache/spark/pull/15770]). Will submit a PR once PR 15770 is merged in. h1. > spark.ml Python API for PIC > --- > > Key: SPARK-19826 > URL: https://issues.apache.org/jira/browse/SPARK-19826 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org