[jira] [Commented] (SPARK-29239) Subquery should not cause NPE when eliminating subexpression
[ https://issues.apache.org/jira/browse/SPARK-29239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16937495#comment-16937495 ] Liang-Chi Hsieh commented on SPARK-29239: - I added SPARK-29221 to the title of the PR. > Subquery should not cause NPE when eliminating subexpression > > > Key: SPARK-29239 > URL: https://issues.apache.org/jira/browse/SPARK-29239 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > Subexpression elimination can possibly cause NPE when applying on execution > subquery expression like ScalarSubquery. It is because PlanExpression wraps > query plan. To compare query plan on executor when eliminating subexpression, > can cause unexpected error, like NPE when accessing transient fields. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29239) Subquery should not cause NPE when eliminating subexpression
[ https://issues.apache.org/jira/browse/SPARK-29239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16937486#comment-16937486 ] Liang-Chi Hsieh commented on SPARK-29239: - Yes. > Subquery should not cause NPE when eliminating subexpression > > > Key: SPARK-29239 > URL: https://issues.apache.org/jira/browse/SPARK-29239 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > Subexpression elimination can possibly cause NPE when applying on execution > subquery expression like ScalarSubquery. It is because PlanExpression wraps > query plan. To compare query plan on executor when eliminating subexpression, > can cause unexpected error, like NPE when accessing transient fields. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29239) Subquery should not cause NPE when eliminating subexpression
Liang-Chi Hsieh created SPARK-29239: --- Summary: Subquery should not cause NPE when eliminating subexpression Key: SPARK-29239 URL: https://issues.apache.org/jira/browse/SPARK-29239 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh Subexpression elimination can possibly cause NPE when applying on execution subquery expression like ScalarSubquery. It is because PlanExpression wraps query plan. To compare query plan on executor when eliminating subexpression, can cause unexpected error, like NPE when accessing transient fields. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29181) Cache preferred locations of checkpointed RDD
[ https://issues.apache.org/jira/browse/SPARK-29181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh resolved SPARK-29181. - Resolution: Duplicate > Cache preferred locations of checkpointed RDD > - > > Key: SPARK-29181 > URL: https://issues.apache.org/jira/browse/SPARK-29181 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > One Spark job in our cluster fits many ALS models in parallel. The fitting > goes well, but in next when we union all factors, the union operation is very > slow. > By looking into the driver stack dump, looks like the driver spends a lot of > time on computing preferred locations. As we checkpoint training data before > fitting ALS, the time is spent on > ReliableCheckpointRDD.getPreferredLocations. In this method, it will call DFS > interface to query file status and block locations. As we have big number of > partitions derived from the checkpointed RDD, the union will spend a lot of > time on querying the same information. > This proposes to add a Spark config to control the caching behavior of > ReliableCheckpointRDD.getPreferredLocations. If it is enabled, > getPreferredLocations will only compute preferred locations once and cache it > for late usage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29182) Cache preferred locations of checkpointed RDD
[ https://issues.apache.org/jira/browse/SPARK-29182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh reassigned SPARK-29182: --- Assignee: Liang-Chi Hsieh > Cache preferred locations of checkpointed RDD > - > > Key: SPARK-29182 > URL: https://issues.apache.org/jira/browse/SPARK-29182 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > > One Spark job in our cluster fits many ALS models in parallel. The fitting > goes well, but in next when we union all factors, the union operation is very > slow. > By looking into the driver stack dump, looks like the driver spends a lot of > time on computing preferred locations. As we checkpoint training data before > fitting ALS, the time is spent on > ReliableCheckpointRDD.getPreferredLocations. In this method, it will call DFS > interface to query file status and block locations. As we have big number of > partitions derived from the checkpointed RDD, the union will spend a lot of > time on querying the same information. > This proposes to add a Spark config to control the caching behavior of > ReliableCheckpointRDD.getPreferredLocations. If it is enabled, > getPreferredLocations will only compute preferred locations once and cache it > for late usage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29181) Cache preferred locations of checkpointed RDD
[ https://issues.apache.org/jira/browse/SPARK-29181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933955#comment-16933955 ] Liang-Chi Hsieh commented on SPARK-29181: - [~dongjoon] Thanks. Not aware of creating duplicate one. > Cache preferred locations of checkpointed RDD > - > > Key: SPARK-29181 > URL: https://issues.apache.org/jira/browse/SPARK-29181 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > One Spark job in our cluster fits many ALS models in parallel. The fitting > goes well, but in next when we union all factors, the union operation is very > slow. > By looking into the driver stack dump, looks like the driver spends a lot of > time on computing preferred locations. As we checkpoint training data before > fitting ALS, the time is spent on > ReliableCheckpointRDD.getPreferredLocations. In this method, it will call DFS > interface to query file status and block locations. As we have big number of > partitions derived from the checkpointed RDD, the union will spend a lot of > time on querying the same information. > This proposes to add a Spark config to control the caching behavior of > ReliableCheckpointRDD.getPreferredLocations. If it is enabled, > getPreferredLocations will only compute preferred locations once and cache it > for late usage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29182) Cache preferred locations of checkpointed RDD
Liang-Chi Hsieh created SPARK-29182: --- Summary: Cache preferred locations of checkpointed RDD Key: SPARK-29182 URL: https://issues.apache.org/jira/browse/SPARK-29182 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh One Spark job in our cluster fits many ALS models in parallel. The fitting goes well, but in next when we union all factors, the union operation is very slow. By looking into the driver stack dump, looks like the driver spends a lot of time on computing preferred locations. As we checkpoint training data before fitting ALS, the time is spent on ReliableCheckpointRDD.getPreferredLocations. In this method, it will call DFS interface to query file status and block locations. As we have big number of partitions derived from the checkpointed RDD, the union will spend a lot of time on querying the same information. This proposes to add a Spark config to control the caching behavior of ReliableCheckpointRDD.getPreferredLocations. If it is enabled, getPreferredLocations will only compute preferred locations once and cache it for late usage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29181) Cache preferred locations of checkpointed RDD
Liang-Chi Hsieh created SPARK-29181: --- Summary: Cache preferred locations of checkpointed RDD Key: SPARK-29181 URL: https://issues.apache.org/jira/browse/SPARK-29181 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh One Spark job in our cluster fits many ALS models in parallel. The fitting goes well, but in next when we union all factors, the union operation is very slow. By looking into the driver stack dump, looks like the driver spends a lot of time on computing preferred locations. As we checkpoint training data before fitting ALS, the time is spent on ReliableCheckpointRDD.getPreferredLocations. In this method, it will call DFS interface to query file status and block locations. As we have big number of partitions derived from the checkpointed RDD, the union will spend a lot of time on querying the same information. This proposes to add a Spark config to control the caching behavior of ReliableCheckpointRDD.getPreferredLocations. If it is enabled, getPreferredLocations will only compute preferred locations once and cache it for late usage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29042) Sampling-based RDD with unordered input should be INDETERMINATE
[ https://issues.apache.org/jira/browse/SPARK-29042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932784#comment-16932784 ] Liang-Chi Hsieh commented on SPARK-29042: - [~hyukjin.kwon] Am I setting the fix versions and affects version correct after backport? Can you take a look? Thanks. > Sampling-based RDD with unordered input should be INDETERMINATE > --- > > Key: SPARK-29042 > URL: https://issues.apache.org/jira/browse/SPARK-29042 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Labels: correctness > Fix For: 2.4.5, 3.0.0 > > > We have found and fixed the correctness issue when RDD output is > INDETERMINATE. One missing part is sampling-based RDD. This kind of RDDs is > order sensitive to its input. A sampling-based RDD with unordered input, > should be INDETERMINATE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29042) Sampling-based RDD with unordered input should be INDETERMINATE
[ https://issues.apache.org/jira/browse/SPARK-29042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-29042: Fix Version/s: 2.4.5 > Sampling-based RDD with unordered input should be INDETERMINATE > --- > > Key: SPARK-29042 > URL: https://issues.apache.org/jira/browse/SPARK-29042 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Labels: correctness > Fix For: 2.4.5, 3.0.0 > > > We have found and fixed the correctness issue when RDD output is > INDETERMINATE. One missing part is sampling-based RDD. This kind of RDDs is > order sensitive to its input. A sampling-based RDD with unordered input, > should be INDETERMINATE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22796) Add multiple column support to PySpark QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-22796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh resolved SPARK-22796. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25812 [https://github.com/apache/spark/pull/25812] > Add multiple column support to PySpark QuantileDiscretizer > -- > > Key: SPARK-22796 > URL: https://issues.apache.org/jira/browse/SPARK-22796 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22796) Add multiple column support to PySpark QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-22796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh reassigned SPARK-22796: --- Assignee: Huaxin Gao > Add multiple column support to PySpark QuantileDiscretizer > -- > > Key: SPARK-22796 > URL: https://issues.apache.org/jira/browse/SPARK-22796 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Assignee: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances
[ https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930827#comment-16930827 ] Liang-Chi Hsieh commented on SPARK-28927: - Regarding to AUC unstable issue, the nondeterministic training data, if not causing ArrayIndexOutOfBoundsException, can also cause wrong matching in computing factors. I don't have evidence that it is the reason. But it is possible. > ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets > with 12 billion instances > --- > > Key: SPARK-28927 > URL: https://issues.apache.org/jira/browse/SPARK-28927 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Qiang Wang >Assignee: Liang-Chi Hsieh >Priority: Major > Attachments: image-2019-09-02-11-55-33-596.png > > > The stack trace is below: > {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 > BlockManager: Block rdd_10916_493 could not be removed as it was not found on > disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for > task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) > java.lang.ArrayIndexOutOfBoundsException: 6741 at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460) > at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at scala.collection.immutable.List.foreach(List.scala:381) at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at > org.apache.spark.scheduler.Task.run(Task.scala:108) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {quote} > This exception happened sometimes. And we also found that the AUC metric was > not stable when evaluating the inner product of the user factors and the item > factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 > which was not stable for production environment. > Dataset capacity: ~12 billion ratings > Here is the our code: > {code:java} > val hivedata = sc.sql(sqltext).select("id", "dpid", "score", "tag") > .repartitio
[jira] [Commented] (SPARK-26205) Optimize InSet expression for bytes, shorts, ints, dates
[ https://issues.apache.org/jira/browse/SPARK-26205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930787#comment-16930787 ] Liang-Chi Hsieh commented on SPARK-26205: - [~cloud_fan]. I see now. Created SPARK-29100. > Optimize InSet expression for bytes, shorts, ints, dates > > > Key: SPARK-26205 > URL: https://issues.apache.org/jira/browse/SPARK-26205 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Assignee: Anton Okolnychyi >Priority: Major > Fix For: 3.0.0 > > > {{In}} expressions are compiled into a sequence of if-else statements, which > results in O\(n\) time complexity. {{InSet}} is an optimized version of > {{In}}, which is supposed to improve the performance if the number of > elements is big enough. However, {{InSet}} actually degrades the performance > in many cases due to various reasons (benchmarks were created in SPARK-26203 > and solutions to the boxing problem are discussed in SPARK-26204). > The main idea of this JIRA is to use Java {{switch}} statements to > significantly improve the performance of {{InSet}} expressions for bytes, > shorts, ints, dates. All {{switch}} statements are compiled into > {{tableswitch}} and {{lookupswitch}} bytecode instructions. We will have > O\(1\) time complexity if our case values are compact and {{tableswitch}} can > be used. Otherwise, {{lookupswitch}} will give us O\(log n\). Our local > benchmarks show that this logic is more than two times faster even on 500+ > elements than using primitive collections in {{InSet}} expressions. As Spark > is using Scala {{HashSet}} right now, the performance gain will be is even > bigger. > See > [here|https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10] > and > [here|https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch] > for more information. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29100) Codegen with switch in InSet expression causes compilation error
[ https://issues.apache.org/jira/browse/SPARK-29100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh reassigned SPARK-29100: --- Assignee: Liang-Chi Hsieh > Codegen with switch in InSet expression causes compilation error > > > Key: SPARK-29100 > URL: https://issues.apache.org/jira/browse/SPARK-29100 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > > SPARK-26205 adds an optimization to InSet that generates Java switch > condition for certain cases. When the given set is empty, it is possibly that > codegen causes compilation error: > > [info] - SPARK-29100: InSet with empty input set *** FAILED *** (58 > milliseconds) > [info] Code generation of input[0, int, true] INSET () failed: > > [info] org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in > "generated.java": Compiling "apply(java.lang.Object _i)"; > apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous > size 0, now 1 > > [info] org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in > "generated.java": Compiling "apply(java.lang.Object _i)"; > apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous > size 0, now 1 > > [info] at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1308) > > > [info] at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1386) > > [info] at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1383) -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29100) Codegen with switch in InSet expression causes compilation error
[ https://issues.apache.org/jira/browse/SPARK-29100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-29100: Description: SPARK-26205 adds an optimization to InSet that generates Java switch condition for certain cases. When the given set is empty, it is possibly that codegen causes compilation error: [info] - SPARK-29100: InSet with empty input set *** FAILED *** (58 milliseconds) [info] Code generation of input[0, int, true] INSET () failed: [info] org.codehaus.janino.InternalCompilerException: failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in "generated.java": Compiling "apply(java.lang.Object _i)"; apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous size 0, now 1 [info] org.codehaus.janino.InternalCompilerException: failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in "generated.java": Compiling "apply(java.lang.Object _i)"; apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous size 0, now 1 [info] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1308) [info] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1386) [info] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1383) was:SPARK-26205 adds an optimization to InSet that generates Java switch condition for certain cases. When the given set is empty, it is possibly that codegen causes compilation error. > Codegen with switch in InSet expression causes compilation error > > > Key: SPARK-29100 > URL: https://issues.apache.org/jira/browse/SPARK-29100 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > SPARK-26205 adds an optimization to InSet that generates Java switch > condition for certain cases. When the given set is empty, it is possibly that > codegen causes compilation error: > > [info] - SPARK-29100: InSet with empty input set *** FAILED *** (58 > milliseconds) > [info] Code generation of input[0, int, true] INSET () failed: > > [info] org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in > "generated.java": Compiling "apply(java.lang.Object _i)"; > apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous > size 0, now 1 > > [info] org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in > "generated.java": Compiling "apply(java.lang.Object _i)"; > apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous > size 0, now 1 > > [info] at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1308) > > > [info] at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1386) > > [info] at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1383) -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29100) Codegen with switch in InSet expression causes compilation error
Liang-Chi Hsieh created SPARK-29100: --- Summary: Codegen with switch in InSet expression causes compilation error Key: SPARK-29100 URL: https://issues.apache.org/jira/browse/SPARK-29100 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh SPARK-26205 adds an optimization to InSet that generates Java switch condition for certain cases. When the given set is empty, it is possibly that codegen causes compilation error. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances
[ https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930658#comment-16930658 ] Liang-Chi Hsieh edited comment on SPARK-28927 at 9/16/19 3:36 PM: -- Because you are using 2.2.1, spark.sql.execution.sortBeforeRepartition is introduced to fix shuffle + repartition issue on Dataframe after 2.4. Thus the nondeterministic behavior can also be triggered when a repartition call following a shuffle in Spark 2.2.1. I noticed that you have repartition after you read data using a sql. Maybe your sqltext has a shuffle operation. You can try to checkpoint your training data, before fitting ALS model, to make the data deterministic. was (Author: viirya): Because you are using 2.2.1, spark.sql.execution.sortBeforeRepartition is introduced to fix shuffle + repartition issue on Dataframe after 2.4. The nondeterministic behavior can be triggered when a repartition call following a shuffle. I noticed that you have repartition after you read data using a sql. Maybe your sqltext has a shuffle operation. You can try to checkpoint your training data, before fitting ALS model, to make the data deterministic. > ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets > with 12 billion instances > --- > > Key: SPARK-28927 > URL: https://issues.apache.org/jira/browse/SPARK-28927 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Qiang Wang >Assignee: Liang-Chi Hsieh >Priority: Major > Attachments: image-2019-09-02-11-55-33-596.png > > > The stack trace is below: > {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 > BlockManager: Block rdd_10916_493 could not be removed as it was not found on > disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for > task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) > java.lang.ArrayIndexOutOfBoundsException: 6741 at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460) > at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at scala.collection.immutable.List.foreach(List.scala:381) at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at > org.apache.spark.scheduler.Task.run(Task.scala
[jira] [Comment Edited] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances
[ https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930658#comment-16930658 ] Liang-Chi Hsieh edited comment on SPARK-28927 at 9/16/19 3:35 PM: -- Because you are using 2.2.1, spark.sql.execution.sortBeforeRepartition is introduced to fix shuffle + repartition issue on Dataframe after 2.4. The nondeterministic behavior can be triggered when a repartition call following a shuffle. I noticed that you have repartition after you read data using a sql. Maybe your sqltext has a shuffle operation. You can try to checkpoint your training data, before fitting ALS model, to make the data deterministic. was (Author: viirya): Because you are using 2.2.1, spark.sql.execution.sortBeforeRepartition is introduced to fix shuffle + repartition issue on Dataframe. The nondeterministic behavior can be triggered when a repartition call following a shuffle. I noticed that you have repartition after you read data using a sql. Maybe your sqltext has a shuffle operation. You can try to checkpoint your training data, before fitting ALS model, to make the data deterministic. > ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets > with 12 billion instances > --- > > Key: SPARK-28927 > URL: https://issues.apache.org/jira/browse/SPARK-28927 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Qiang Wang >Assignee: Liang-Chi Hsieh >Priority: Major > Attachments: image-2019-09-02-11-55-33-596.png > > > The stack trace is below: > {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 > BlockManager: Block rdd_10916_493 could not be removed as it was not found on > disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for > task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) > java.lang.ArrayIndexOutOfBoundsException: 6741 at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460) > at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at scala.collection.immutable.List.foreach(List.scala:381) at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at > org.apache.spark.scheduler.Task.run(Task.scala:108) at > org.apache.spark.execut
[jira] [Commented] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances
[ https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930658#comment-16930658 ] Liang-Chi Hsieh commented on SPARK-28927: - Because you are using 2.2.1, spark.sql.execution.sortBeforeRepartition is introduced to fix shuffle + repartition issue on Dataframe. The nondeterministic behavior can be triggered when a repartition call following a shuffle. I noticed that you have repartition after you read data using a sql. Maybe your sqltext has a shuffle operation. You can try to checkpoint your training data, before fitting ALS model, to make the data deterministic. > ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets > with 12 billion instances > --- > > Key: SPARK-28927 > URL: https://issues.apache.org/jira/browse/SPARK-28927 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Qiang Wang >Assignee: Liang-Chi Hsieh >Priority: Major > Attachments: image-2019-09-02-11-55-33-596.png > > > The stack trace is below: > {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 > BlockManager: Block rdd_10916_493 could not be removed as it was not found on > disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for > task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) > java.lang.ArrayIndexOutOfBoundsException: 6741 at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460) > at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at scala.collection.immutable.List.foreach(List.scala:381) at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at > org.apache.spark.scheduler.Task.run(Task.scala:108) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {quote} > This exception happened sometimes. And we also found that the AUC metric was > not stable when evaluating the inner product of the user factors and the item > factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 > which was n
[jira] [Assigned] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances
[ https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh reassigned SPARK-28927: --- Assignee: Liang-Chi Hsieh > ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets > with 12 billion instances > --- > > Key: SPARK-28927 > URL: https://issues.apache.org/jira/browse/SPARK-28927 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Qiang Wang >Assignee: Liang-Chi Hsieh >Priority: Major > Attachments: image-2019-09-02-11-55-33-596.png > > > The stack trace is below: > {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 > BlockManager: Block rdd_10916_493 could not be removed as it was not found on > disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for > task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) > java.lang.ArrayIndexOutOfBoundsException: 6741 at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460) > at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at scala.collection.immutable.List.foreach(List.scala:381) at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at > org.apache.spark.scheduler.Task.run(Task.scala:108) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {quote} > This exception happened sometimes. And we also found that the AUC metric was > not stable when evaluating the inner product of the user factors and the item > factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 > which was not stable for production environment. > Dataset capacity: ~12 billion ratings > Here is the our code: > val trainData = predataUser.flatMap(x => x._1._2.map(y => (x._2.toInt, y._1, > y._2.toFloat))) > .setName(trainDataName).persist(StorageLevel.MEMORY_AND_DISK_SER)case class > ALSData(user:Int, item:Int, rating:Float) extends Serializable > val ratingData = trainData.map(x => ALSData(x._1, x._2, x._3)).toDF() > val als = new ALS > val paramMap = ParamM
[jira] [Assigned] (SPARK-29042) Sampling-based RDD with unordered input should be INDETERMINATE
[ https://issues.apache.org/jira/browse/SPARK-29042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh reassigned SPARK-29042: --- Assignee: Liang-Chi Hsieh > Sampling-based RDD with unordered input should be INDETERMINATE > --- > > Key: SPARK-29042 > URL: https://issues.apache.org/jira/browse/SPARK-29042 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Labels: correctness > > We have found and fixed the correctness issue when RDD output is > INDETERMINATE. One missing part is sampling-based RDD. This kind of RDDs is > order sensitive to its input. A sampling-based RDD with unordered input, > should be INDETERMINATE. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29042) Sampling-based RDD with unordered input should be INDETERMINATE
[ https://issues.apache.org/jira/browse/SPARK-29042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh resolved SPARK-29042. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25751 [https://github.com/apache/spark/pull/25751] > Sampling-based RDD with unordered input should be INDETERMINATE > --- > > Key: SPARK-29042 > URL: https://issues.apache.org/jira/browse/SPARK-29042 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Labels: correctness > Fix For: 3.0.0 > > > We have found and fixed the correctness issue when RDD output is > INDETERMINATE. One missing part is sampling-based RDD. This kind of RDDs is > order sensitive to its input. A sampling-based RDD with unordered input, > should be INDETERMINATE. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26205) Optimize InSet expression for bytes, shorts, ints, dates
[ https://issues.apache.org/jira/browse/SPARK-26205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929334#comment-16929334 ] Liang-Chi Hsieh commented on SPARK-26205: - [~cloud_fan] I ran a simple test, seems no failure happens? val inSet = InSet(Literal(0), Set.empty) checkEvaluation(inSet, false, row1) I verified that it is codegen with genCodeWithSwitch. > Optimize InSet expression for bytes, shorts, ints, dates > > > Key: SPARK-26205 > URL: https://issues.apache.org/jira/browse/SPARK-26205 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Assignee: Anton Okolnychyi >Priority: Major > Fix For: 3.0.0 > > > {{In}} expressions are compiled into a sequence of if-else statements, which > results in O\(n\) time complexity. {{InSet}} is an optimized version of > {{In}}, which is supposed to improve the performance if the number of > elements is big enough. However, {{InSet}} actually degrades the performance > in many cases due to various reasons (benchmarks were created in SPARK-26203 > and solutions to the boxing problem are discussed in SPARK-26204). > The main idea of this JIRA is to use Java {{switch}} statements to > significantly improve the performance of {{InSet}} expressions for bytes, > shorts, ints, dates. All {{switch}} statements are compiled into > {{tableswitch}} and {{lookupswitch}} bytecode instructions. We will have > O\(1\) time complexity if our case values are compact and {{tableswitch}} can > be used. Otherwise, {{lookupswitch}} will give us O\(log n\). Our local > benchmarks show that this logic is more than two times faster even on 500+ > elements than using primitive collections in {{InSet}} expressions. As Spark > is using Scala {{HashSet}} right now, the performance gain will be is even > bigger. > See > [here|https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10] > and > [here|https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch] > for more information. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26205) Optimize InSet expression for bytes, shorts, ints, dates
[ https://issues.apache.org/jira/browse/SPARK-26205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16928649#comment-16928649 ] Liang-Chi Hsieh commented on SPARK-26205: - Yeah, I will look at it. > Optimize InSet expression for bytes, shorts, ints, dates > > > Key: SPARK-26205 > URL: https://issues.apache.org/jira/browse/SPARK-26205 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Assignee: Anton Okolnychyi >Priority: Major > Fix For: 3.0.0 > > > {{In}} expressions are compiled into a sequence of if-else statements, which > results in O\(n\) time complexity. {{InSet}} is an optimized version of > {{In}}, which is supposed to improve the performance if the number of > elements is big enough. However, {{InSet}} actually degrades the performance > in many cases due to various reasons (benchmarks were created in SPARK-26203 > and solutions to the boxing problem are discussed in SPARK-26204). > The main idea of this JIRA is to use Java {{switch}} statements to > significantly improve the performance of {{InSet}} expressions for bytes, > shorts, ints, dates. All {{switch}} statements are compiled into > {{tableswitch}} and {{lookupswitch}} bytecode instructions. We will have > O\(1\) time complexity if our case values are compact and {{tableswitch}} can > be used. Otherwise, {{lookupswitch}} will give us O\(log n\). Our local > benchmarks show that this logic is more than two times faster even on 500+ > elements than using primitive collections in {{InSet}} expressions. As Spark > is using Scala {{HashSet}} right now, the performance gain will be is even > bigger. > See > [here|https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10] > and > [here|https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch] > for more information. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29042) Sampling-based RDD with unordered input should be INDETERMINATE
Liang-Chi Hsieh created SPARK-29042: --- Summary: Sampling-based RDD with unordered input should be INDETERMINATE Key: SPARK-29042 URL: https://issues.apache.org/jira/browse/SPARK-29042 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh We have found and fixed the correctness issue when RDD output is INDETERMINATE. One missing part is sampling-based RDD. This kind of RDDs is order sensitive to its input. A sampling-based RDD with unordered input, should be INDETERMINATE. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances
[ https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926814#comment-16926814 ] Liang-Chi Hsieh commented on SPARK-28927: - Hi [~JerryHouse], do you use any non-deterministic operations when preparing your training dataset, like sample, filtering based on random number, etc.? > ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets > with 12 billion instances > --- > > Key: SPARK-28927 > URL: https://issues.apache.org/jira/browse/SPARK-28927 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Qiang Wang >Priority: Major > Attachments: image-2019-09-02-11-55-33-596.png > > > The stack trace is below: > {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 > BlockManager: Block rdd_10916_493 could not be removed as it was not found on > disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for > task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) > java.lang.ArrayIndexOutOfBoundsException: 6741 at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460) > at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at scala.collection.immutable.List.foreach(List.scala:381) at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at > org.apache.spark.scheduler.Task.run(Task.scala:108) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {quote} > This exception happened sometimes. And we also found that the AUC metric was > not stable when evaluating the inner product of the user factors and the item > factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 > which was not stable for production environment. > Dataset capacity: ~12 billion ratings > Here is the our code: > val trainData = predataUser.flatMap(x => x._1._2.map(y => (x._2.toInt, y._1, > y._2.toFloat))) > .setName(trainDataName).persist(StorageLevel.MEMORY_AND_DISK_SER)case class > ALSData(user:Int, item:Int, rating:Float) extends
[jira] [Assigned] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh reassigned SPARK-23265: --- Assignee: Huaxin Gao > Update multi-column error handling logic in QuantileDiscretizer > --- > > Key: SPARK-23265 > URL: https://issues.apache.org/jira/browse/SPARK-23265 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Assignee: Huaxin Gao >Priority: Major > > SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If > both single- and mulit-column params are set (specifically {{inputCol}} / > {{inputCols}}) an error is thrown. > However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. > The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that > for this transformer, it is acceptable to set the single-column param for > \{{numBuckets}} when transforming multiple columns, since that is then > applied to all columns. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh resolved SPARK-23265. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 20442 [https://github.com/apache/spark/pull/20442] > Update multi-column error handling logic in QuantileDiscretizer > --- > > Key: SPARK-23265 > URL: https://issues.apache.org/jira/browse/SPARK-23265 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > > SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If > both single- and mulit-column params are set (specifically {{inputCol}} / > {{inputCols}}) an error is thrown. > However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. > The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that > for this transformer, it is acceptable to set the single-column param for > \{{numBuckets}} when transforming multiple columns, since that is then > applied to all columns. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29013) Structurally equivalent subexpression elimination
Liang-Chi Hsieh created SPARK-29013: --- Summary: Structurally equivalent subexpression elimination Key: SPARK-29013 URL: https://issues.apache.org/jira/browse/SPARK-29013 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh We do semantically equivalent subexpression elimination in SparkSQL. However, for some expressions that are not semantically equivalent, but structurally equivalent, current subexpression elimination generates too many similar functions. These functions share same computation structure but only differ in input slots of current processing row. For such expressions, we can generate just one function, and pass in input slots during runtime. It can reduce the length of generated code text, and save compilation time. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28933) Reduce unnecessary shuffle in ALS when initializing factors
[ https://issues.apache.org/jira/browse/SPARK-28933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-28933: Fix Version/s: 3.0.0 > Reduce unnecessary shuffle in ALS when initializing factors > --- > > Key: SPARK-28933 > URL: https://issues.apache.org/jira/browse/SPARK-28933 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 3.0.0 > > > When Initializing factors in ALS, we should use {{mapPartitions}} instead of > current {{map}}, so we can preserve existing partition of the RDD of > {{InBlock}}. The RDD of {{InBlock}} is already partitioned by src block id. > We don't change the partition when initializing factors. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28933) Reduce unnecessary shuffle in ALS when initializing factors
[ https://issues.apache.org/jira/browse/SPARK-28933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920584#comment-16920584 ] Liang-Chi Hsieh commented on SPARK-28933: - This issue was resolved by [https://github.com/apache/spark/pull/25639]. > Reduce unnecessary shuffle in ALS when initializing factors > --- > > Key: SPARK-28933 > URL: https://issues.apache.org/jira/browse/SPARK-28933 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > > When Initializing factors in ALS, we should use {{mapPartitions}} instead of > current {{map}}, so we can preserve existing partition of the RDD of > {{InBlock}}. The RDD of {{InBlock}} is already partitioned by src block id. > We don't change the partition when initializing factors. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28933) Reduce unnecessary shuffle in ALS when initializing factors
[ https://issues.apache.org/jira/browse/SPARK-28933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh resolved SPARK-28933. - Resolution: Resolved > Reduce unnecessary shuffle in ALS when initializing factors > --- > > Key: SPARK-28933 > URL: https://issues.apache.org/jira/browse/SPARK-28933 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > > When Initializing factors in ALS, we should use {{mapPartitions}} instead of > current {{map}}, so we can preserve existing partition of the RDD of > {{InBlock}}. The RDD of {{InBlock}} is already partitioned by src block id. > We don't change the partition when initializing factors. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28935) Document SQL metrics for Details for Query Plan
[ https://issues.apache.org/jira/browse/SPARK-28935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920550#comment-16920550 ] Liang-Chi Hsieh commented on SPARK-28935: - Thanks! [~smilegator] It should be helpful. > Document SQL metrics for Details for Query Plan > --- > > Key: SPARK-28935 > URL: https://issues.apache.org/jira/browse/SPARK-28935 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > [https://github.com/apache/spark/pull/25349] shows the query plans but it > does not describe the meaning of each metric in the plan. For end users, they > might not understand the meaning of the metrics we output. > > !https://user-images.githubusercontent.com/7322292/62421634-9d9c4980-b6d7-11e9-8e31-1e6ba9b402e8.png! -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances
[ https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920480#comment-16920480 ] Liang-Chi Hsieh commented on SPARK-28927: - Does this only happen on 2.2.1? How about current master branch? > ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets > with 12 billion instances > --- > > Key: SPARK-28927 > URL: https://issues.apache.org/jira/browse/SPARK-28927 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Qiang Wang >Priority: Major > > The stack trace is below: > {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 > BlockManager: Block rdd_10916_493 could not be removed as it was not found on > disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for > task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) > java.lang.ArrayIndexOutOfBoundsException: 6741 at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460) > at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at scala.collection.immutable.List.foreach(List.scala:381) at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at > org.apache.spark.scheduler.Task.run(Task.scala:108) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {quote} > This exception happened sometimes. And we also found that the AUC metric was > not stable when evaluating the inner product of the user factors and the item > factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 > which was not stable for production environment. > Dataset capacity: ~12 billion ratings > Here is the our code: > val trainData = predataUser.flatMap(x => x._1._2.map(y => (x._2.toInt, y._1, > y._2.toFloat))) > .setName(trainDataName).persist(StorageLevel.MEMORY_AND_DISK_SER)case class > ALSData(user:Int, item:Int, rating:Float) extends Serializable > val ratingData = trainData.map(x => ALSData(x._1, x._2, x._3)).toDF() > val als = new ALS > val paramMap = ParamMap(als.alpha
[jira] [Updated] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
[ https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-23519: Component/s: (was: Spark Core) > Create View Commands Fails with The view output (col1,col1) contains > duplicate column name > --- > > Key: SPARK-23519 > URL: https://issues.apache.org/jira/browse/SPARK-23519 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Franck Tago >Assignee: hemanth meka >Priority: Major > Fix For: 3.0.0 > > Attachments: image-2018-05-10-10-48-57-259.png > > > 1- create and populate a hive table . I did this in a hive cli session .[ > not that this matters ] > create table atable (col1 int) ; > insert into atable values (10 ) , (100) ; > 2. create a view from the table. > [These actions were performed from a spark shell ] > spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 > from atable ") > java.lang.AssertionError: assertion failed: The view output (col1,col1) > contains duplicate column name. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
[ https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920302#comment-16920302 ] Liang-Chi Hsieh commented on SPARK-23519: - This was closed and then reopened and fixed. The label [bulk-closed|https://issues.apache.org/jira/issues/?jql=labels+%3D+bulk-closed] looks not correct. I remove it. Feel free to add it back if I misunderstand it. > Create View Commands Fails with The view output (col1,col1) contains > duplicate column name > --- > > Key: SPARK-23519 > URL: https://issues.apache.org/jira/browse/SPARK-23519 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Franck Tago >Assignee: hemanth meka >Priority: Major > Labels: bulk-closed > Fix For: 3.0.0 > > Attachments: image-2018-05-10-10-48-57-259.png > > > 1- create and populate a hive table . I did this in a hive cli session .[ > not that this matters ] > create table atable (col1 int) ; > insert into atable values (10 ) , (100) ; > 2. create a view from the table. > [These actions were performed from a spark shell ] > spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 > from atable ") > java.lang.AssertionError: assertion failed: The view output (col1,col1) > contains duplicate column name. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
[ https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-23519: Labels: (was: bulk-closed) > Create View Commands Fails with The view output (col1,col1) contains > duplicate column name > --- > > Key: SPARK-23519 > URL: https://issues.apache.org/jira/browse/SPARK-23519 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Franck Tago >Assignee: hemanth meka >Priority: Major > Fix For: 3.0.0 > > Attachments: image-2018-05-10-10-48-57-259.png > > > 1- create and populate a hive table . I did this in a hive cli session .[ > not that this matters ] > create table atable (col1 int) ; > insert into atable values (10 ) , (100) ; > 2. create a view from the table. > [These actions were performed from a spark shell ] > spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 > from atable ") > java.lang.AssertionError: assertion failed: The view output (col1,col1) > contains duplicate column name. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28935) Document SQL metrics for Details for Query Plan
[ https://issues.apache.org/jira/browse/SPARK-28935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16919975#comment-16919975 ] Liang-Chi Hsieh commented on SPARK-28935: - Thanks for pinging me! I will look into this. > Document SQL metrics for Details for Query Plan > --- > > Key: SPARK-28935 > URL: https://issues.apache.org/jira/browse/SPARK-28935 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > [https://github.com/apache/spark/pull/25349] shows the query plans but it > does not describe the meaning of each metric in the plan. For end users, they > might not understand the meaning of the metrics we output. > > !https://user-images.githubusercontent.com/7322292/62421634-9d9c4980-b6d7-11e9-8e31-1e6ba9b402e8.png! -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28926) CLONE - ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances
[ https://issues.apache.org/jira/browse/SPARK-28926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh resolved SPARK-28926. - Resolution: Duplicate I think this is duplicate to SPARK-28927. > CLONE - ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for > datasets with 12 billion instances > > > Key: SPARK-28926 > URL: https://issues.apache.org/jira/browse/SPARK-28926 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.2.1 >Reporter: Qiang Wang >Assignee: Xiangrui Meng >Priority: Major > > The stack trace is below: > {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 > BlockManager: Block rdd_10916_493 could not be removed as it was not found on > disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for > task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) > java.lang.ArrayIndexOutOfBoundsException: 6741 at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460) > at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at scala.collection.immutable.List.foreach(List.scala:381) at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at > org.apache.spark.scheduler.Task.run(Task.scala:108) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {quote} > This exception happened sometimes. And we also found that the AUC metric was > not stable when evaluating the inner product of the user factors and the item > factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 > which was not stable for production environment. > Dataset capacity: ~12 billion ratings > Here is the our code: > {code:java} > val hivedata = sc.sql(sqltext).select(id,dpid,score).coalesce(numPartitions) > val predataItem = hivedata.rdd.map(r=>(r._1._1,(r._1._2,r._2.sum))) > .groupByKey().zipWithIndex() > .persist(StorageLevel.MEMORY_AND_DISK_SER) > val predataUser = > predataItem.flatMap(r=>r._1._2.map(y=>(y._1,(r._2.toInt,y._2 > .aggregateByKey(zeroValueArr,numPart
[jira] [Assigned] (SPARK-28933) Reduce unnecessary shuffle in ALS when initializing factors
[ https://issues.apache.org/jira/browse/SPARK-28933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh reassigned SPARK-28933: --- Assignee: Liang-Chi Hsieh > Reduce unnecessary shuffle in ALS when initializing factors > --- > > Key: SPARK-28933 > URL: https://issues.apache.org/jira/browse/SPARK-28933 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > > When Initializing factors in ALS, we should use {{mapPartitions}} instead of > current {{map}}, so we can preserve existing partition of the RDD of > {{InBlock}}. The RDD of {{InBlock}} is already partitioned by src block id. > We don't change the partition when initializing factors. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28933) Reduce unnecessary shuffle in ALS when initializing factors
Liang-Chi Hsieh created SPARK-28933: --- Summary: Reduce unnecessary shuffle in ALS when initializing factors Key: SPARK-28933 URL: https://issues.apache.org/jira/browse/SPARK-28933 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh When Initializing factors in ALS, we should use {{mapPartitions}} instead of current {{map}}, so we can preserve existing partition of the RDD of {{InBlock}}. The RDD of {{InBlock}} is already partitioned by src block id. We don't change the partition when initializing factors. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28920) Set up java version for github workflow
Liang-Chi Hsieh created SPARK-28920: --- Summary: Set up java version for github workflow Key: SPARK-28920 URL: https://issues.apache.org/jira/browse/SPARK-28920 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh We added java matrix to github workflow. As we want to build with JDK8/11, we should set up java version for mvn. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
[ https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16915809#comment-16915809 ] Liang-Chi Hsieh commented on SPARK-23519: - I test with Hive 2.1. It doesn't support duplicate column names: {code:java} hive> create view test_view (c1, c2) as select c1, c1 from test; FAILED: SemanticException [Error 10036]: Duplicate column name: c1 {code} [~tafra...@gmail.com] you said Hive supports it, is newer versions of Hive supporting this? > Create View Commands Fails with The view output (col1,col1) contains > duplicate column name > --- > > Key: SPARK-23519 > URL: https://issues.apache.org/jira/browse/SPARK-23519 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Franck Tago >Priority: Major > Labels: bulk-closed > Attachments: image-2018-05-10-10-48-57-259.png > > > 1- create and populate a hive table . I did this in a hive cli session .[ > not that this matters ] > create table atable (col1 int) ; > insert into atable values (10 ) , (100) ; > 2. create a view from the table. > [These actions were performed from a spark shell ] > spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 > from atable ") > java.lang.AssertionError: assertion failed: The view output (col1,col1) > contains duplicate column name. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25549) High level API to collect RDD statistics
[ https://issues.apache.org/jira/browse/SPARK-25549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh resolved SPARK-25549. - Resolution: Won't Fix > High level API to collect RDD statistics > > > Key: SPARK-25549 > URL: https://issues.apache.org/jira/browse/SPARK-25549 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > We have low level API SparkContext.submitMapStage used for collecting > statistics of RDD. However it is too low level and is not so easy to use. We > need a high level API for that. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25549) High level API to collect RDD statistics
[ https://issues.apache.org/jira/browse/SPARK-25549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16915362#comment-16915362 ] Liang-Chi Hsieh commented on SPARK-25549: - Close this as it is not needed now. > High level API to collect RDD statistics > > > Key: SPARK-25549 > URL: https://issues.apache.org/jira/browse/SPARK-25549 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > We have low level API SparkContext.submitMapStage used for collecting > statistics of RDD. However it is too low level and is not so easy to use. We > need a high level API for that. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28866) Persist item factors RDD when checkpointing in ALS
Liang-Chi Hsieh created SPARK-28866: --- Summary: Persist item factors RDD when checkpointing in ALS Key: SPARK-28866 URL: https://issues.apache.org/jira/browse/SPARK-28866 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh In ALS ML implementation, if `implicitPrefs` is false, we checkpoint the RDD of item factors, between intervals. Before checkpointing and materializing RDD, this RDD was not persisted. It causes recomputation. In an experiment, there is performance difference between persisting and no persisting before checkpointing the RDD. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24666) Word2Vec generate infinity vectors when numIterations are large
[ https://issues.apache.org/jira/browse/SPARK-24666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914950#comment-16914950 ] Liang-Chi Hsieh commented on SPARK-24666: - I tried to run word2vec with Quora Question Pairs dataset. Set max iteration 20, but can't reproduce this. > Word2Vec generate infinity vectors when numIterations are large > --- > > Key: SPARK-24666 > URL: https://issues.apache.org/jira/browse/SPARK-24666 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.3.1 > Environment: 2.0.X, 2.1.X, 2.2.X, 2.3.X >Reporter: ZhongYu >Priority: Critical > > We found that Word2Vec generate large absolute value vectors when > numIterations are large, and if numIterations are large enough (>20), the > vector's value many be *infinity(or -**infinity)***, resulting in useless > vectors. > In normal situations, vectors values are mainly around -1.0~1.0 when > numIterations = 1. > The bug is shown on spark 2.0.X, 2.1.X, 2.2.X, 2.3.X. > There are already issues report this bug: > https://issues.apache.org/jira/browse/SPARK-5261 , but the bug fix works > seems missing. > Other people's reports: > [https://stackoverflow.com/questions/49741956/infinity-vectors-in-spark-mllib-word2vec] > [http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-outputs-Infinity-Infinity-vectors-with-increasing-iterations-td29020.html] > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
[ https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913915#comment-16913915 ] Liang-Chi Hsieh commented on SPARK-23519: - Thanks for pinging me. I am going on a flight soon. If this is not urgent, I can look into it after today. > Create View Commands Fails with The view output (col1,col1) contains > duplicate column name > --- > > Key: SPARK-23519 > URL: https://issues.apache.org/jira/browse/SPARK-23519 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Franck Tago >Priority: Major > Labels: bulk-closed > Attachments: image-2018-05-10-10-48-57-259.png > > > 1- create and populate a hive table . I did this in a hive cli session .[ > not that this matters ] > create table atable (col1 int) ; > insert into atable values (10 ) , (100) ; > 2. create a view from the table. > [These actions were performed from a spark shell ] > spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 > from atable ") > java.lang.AssertionError: assertion failed: The view output (col1,col1) > contains duplicate column name. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28672) [UDF] Duplicate function creation should not allow
[ https://issues.apache.org/jira/browse/SPARK-28672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16911007#comment-16911007 ] Liang-Chi Hsieh commented on SPARK-28672: - Is there any rule in Hive regarding this? like disallow duplicate permanent/temporary functions, or resolving temporary/permanent function first when duplicating? > [UDF] Duplicate function creation should not allow > --- > > Key: SPARK-28672 > URL: https://issues.apache.org/jira/browse/SPARK-28672 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > {code} > 0: jdbc:hive2://10.18.18.214:23040/default> create function addm_3 AS > 'com.huawei.bigdata.hive.example.udf.multiply' using jar > 'hdfs://hacluster/user/Multiply.jar'; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.084 seconds) > {code} > {code} > 0: jdbc:hive2://10.18.18.214:23040/default> create temporary function addm_3 > AS 'com.huawei.bigdata.hive.example.udf.multiply' using jar > 'hdfs://hacluster/user/Multiply.jar'; > INFO : converting to local hdfs://hacluster/user/Multiply.jar > INFO : Added > [/tmp/8a396308-41f8-4335-9de4-8268ce5c70fe_resources/Multiply.jar] to class > path > INFO : Added resources: [hdfs://hacluster/user/Multiply.jar] > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.134 seconds) > {code} > {code} > 0: jdbc:hive2://10.18.18.214:23040/default> show functions like addm_3; > +-+--+ > |function | > +-+--+ > | addm_3 | > | default.addm_3 | > +-+--+ > 2 rows selected (0.047 seconds) > {code} > When show function executed it is listing both the function but what about > the db for permanent function when user has not specified. > Duplicate should not be allowed if user creating temporary one with the same > name. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28761) spark.driver.maxResultSize only applies to compressed data
[ https://issues.apache.org/jira/browse/SPARK-28761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909420#comment-16909420 ] Liang-Chi Hsieh commented on SPARK-28761: - If you do it at SparkPlan.scala#L344, isn't it just for SQL? {{spark.driver.maxResultSize}} covers RDD, right? > spark.driver.maxResultSize only applies to compressed data > -- > > Key: SPARK-28761 > URL: https://issues.apache.org/jira/browse/SPARK-28761 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: David Vogelbacher >Priority: Major > > Spark has a setting {{spark.driver.maxResultSize}}, see > https://spark.apache.org/docs/latest/configuration.html#application-properties > : > {noformat} > Limit of total size of serialized results of all partitions for each Spark > action (e.g. collect) in bytes. Should be at least 1M, or 0 for unlimited. > Jobs will be aborted if the total size is above this limit. Having a high > limit may cause out-of-memory errors in driver (depends on > spark.driver.memory and memory overhead of objects in JVM). > Setting a proper limit can protect the driver from out-of-memory errors. > {noformat} > This setting can be very useful in constraining the memory that the spark > driver needs for a specific spark action. However, this limit is checked > before decompressing data in > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L662 > Even if the compressed data is below the limit the uncompressed data can > still be far above. In order to protect the driver we should also impose a > limit on the uncompressed data. We could do this in > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L344 > I propose adding a new config option > {{spark.driver.maxUncompressedResultSize}}. > A simple repro of this with spark shell: > {noformat} > > printf 'a%.0s' {1..10} > test.csv # create a 100 MB file > > ./bin/spark-shell --conf "spark.driver.maxResultSize=1" > scala> val df = spark.read.format("csv").load("/Users/dvogelbacher/test.csv") > df: org.apache.spark.sql.DataFrame = [_c0: string] > scala> val results = df.collect() > results: Array[org.apache.spark.sql.Row] = > Array([a... > scala> results(0).getString(0).size > res0: Int = 10 > {noformat} > Even though we set maxResultSize to 10 MB, we collect a result that is 100MB > uncompressed. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28732) org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java' when storing
[ https://issues.apache.org/jira/browse/SPARK-28732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909409#comment-16909409 ] Liang-Chi Hsieh commented on SPARK-28732: - As {{count}} return type is LongType, I think it is reasonable that it can't be fit into an Int column. The problem here might be the error is not friendly. Normally, if we want to map dataset to specified type, an exception like this should be thrown, if it is incompatible: {code} You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object; at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:2801) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$32$$anonfun$applyOrElse$143.applyOrElse(Analyzer.scala:2821) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$32$$anonfun$applyOrElse$143.applyOrElse(Analyzer.scala:2812) {code} > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to > compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java' when storing the result of a count aggregation in an integer > --- > > Key: SPARK-28732 > URL: https://issues.apache.org/jira/browse/SPARK-28732 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: Alix Métivier >Priority: Major > > I am using agg function on a dataset, and i want to count the number of lines > upon grouping columns. I would like to store the result of this count in an > integer, but it fails with this output : > {code} > [ERROR]: org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - > failed to compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 89, Column 53: No applicable constructor/method found > for actual parameters "long"; candidates are: "java.lang.Integer(int)", > "java.lang.Integer(java.lang.String)" > Here is the line 89 and a few others to understand : > /* 085 */ long value13 = i.getLong(5); > /* 086 */ argValue4 = value13; > /* 087 */ > /* 088 */ > /* 089 */ final java.lang.Integer value12 = false ? null : new > java.lang.Integer(argValue4); > {code} > > As per Integer documentation, there is not constructor for the type Long, so > this is why the generated code fails. > Here is my code : > {code} > org.apache.spark.sql.Dataset ds_row2 = > ds_conntAggregateRow_1_Out_1 > .groupBy(org.apache.spark.sql.functions.col("n_name").as("n_nameN"), > org.apache.spark.sql.functions.col("o_year").as("o_yearN")) > .agg(org.apache.spark.sql.functions.count("n_name").as("countN"), > .as(org.apache.spark.sql.Encoders.bean(row2Struct.class)); > {code} > row2Struct class is composed of n_nameN: String, o_yearN: String, countN: Int > If countN is a Long, code above wont fail > If it is an Int, it works in 1.6 and 2.0, but fails on version 2.1+ > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28732) org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java' when st
[ https://issues.apache.org/jira/browse/SPARK-28732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909409#comment-16909409 ] Liang-Chi Hsieh edited comment on SPARK-28732 at 8/16/19 9:19 PM: -- As {{count}} return type is LongType, I think it is reasonable that it can't be fit into an Int column. The problem here might be the error is not friendly. Normally, if we want to map dataset to specified type, an exception like this should be thrown, if it is incompatible: {code} org.apache.spark.sql.AnalysisException: Cannot up cast `b` from bigint to int. The type path of the target object is: - field (class: "scala.Int", name: "b") - root class: "Test" You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object; at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:2801) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$32$$anonfun$applyOrElse$143.applyOrElse(Analyzer.scala:2821) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$32$$anonfun$applyOrElse$143.applyOrElse(Analyzer.scala:2812) {code} was (Author: viirya): As {{count}} return type is LongType, I think it is reasonable that it can't be fit into an Int column. The problem here might be the error is not friendly. Normally, if we want to map dataset to specified type, an exception like this should be thrown, if it is incompatible: {code} You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object; at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:2801) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$32$$anonfun$applyOrElse$143.applyOrElse(Analyzer.scala:2821) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$32$$anonfun$applyOrElse$143.applyOrElse(Analyzer.scala:2812) {code} > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to > compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java' when storing the result of a count aggregation in an integer > --- > > Key: SPARK-28732 > URL: https://issues.apache.org/jira/browse/SPARK-28732 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: Alix Métivier >Priority: Major > > I am using agg function on a dataset, and i want to count the number of lines > upon grouping columns. I would like to store the result of this count in an > integer, but it fails with this output : > {code} > [ERROR]: org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - > failed to compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 89, Column 53: No applicable constructor/method found > for actual parameters "long"; candidates are: "java.lang.Integer(int)", > "java.lang.Integer(java.lang.String)" > Here is the line 89 and a few others to understand : > /* 085 */ long value13 = i.getLong(5); > /* 086 */ argValue4 = value13; > /* 087 */ > /* 088 */ > /* 089 */ final java.lang.Integer value12 = false ? null : new > java.lang.Integer(argValue4); > {code} > > As per Integer documentation, there is not constructor for the type Long, so > this is why the generated code fails. > Here is my code : > {code} > org.apache.spark.s
[jira] [Created] (SPARK-28722) Change sequential label sorting in StringIndexer fit to parallel
Liang-Chi Hsieh created SPARK-28722: --- Summary: Change sequential label sorting in StringIndexer fit to parallel Key: SPARK-28722 URL: https://issues.apache.org/jira/browse/SPARK-28722 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh The fit method in StringIndexer sorts given labels in a sequential approach, if there are multiple input columns. When the number of input column increases, the time of label sorting dramatically increases too so it is hard to use in practice if dealing with hundreds of input columns. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28652) spark.kubernetes.pyspark.pythonVersion is never passed to executors
[ https://issues.apache.org/jira/browse/SPARK-28652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-28652: Priority: Minor (was: Major) > spark.kubernetes.pyspark.pythonVersion is never passed to executors > --- > > Key: SPARK-28652 > URL: https://issues.apache.org/jira/browse/SPARK-28652 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: nanav yorbiz >Priority: Minor > > I suppose this may not be a priority with Python2 on its way out, but given > that this setting is only ever sent to the driver and not the executors, no > actual work can be performed when the versions don't match, which will tend > to be *always* with the default setting for the driver being changed from 2 > to 3, and the executors using `python`, which defaults to v2, by default. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28652) spark.kubernetes.pyspark.pythonVersion is never passed to executors
[ https://issues.apache.org/jira/browse/SPARK-28652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-28652: Issue Type: Test (was: Bug) > spark.kubernetes.pyspark.pythonVersion is never passed to executors > --- > > Key: SPARK-28652 > URL: https://issues.apache.org/jira/browse/SPARK-28652 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: nanav yorbiz >Priority: Major > > I suppose this may not be a priority with Python2 on its way out, but given > that this setting is only ever sent to the driver and not the executors, no > actual work can be performed when the versions don't match, which will tend > to be *always* with the default setting for the driver being changed from 2 > to 3, and the executors using `python`, which defaults to v2, by default. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28652) spark.kubernetes.pyspark.pythonVersion is never passed to executors
[ https://issues.apache.org/jira/browse/SPARK-28652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16904743#comment-16904743 ] Liang-Chi Hsieh commented on SPARK-28652: - As existing tests don't explicitly check the Python version at executor side, I patched existing test to check Python version at executor side. > spark.kubernetes.pyspark.pythonVersion is never passed to executors > --- > > Key: SPARK-28652 > URL: https://issues.apache.org/jira/browse/SPARK-28652 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: nanav yorbiz >Priority: Major > > I suppose this may not be a priority with Python2 on its way out, but given > that this setting is only ever sent to the driver and not the executors, no > actual work can be performed when the versions don't match, which will tend > to be *always* with the default setting for the driver being changed from 2 > to 3, and the executors using `python`, which defaults to v2, by default. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28652) spark.kubernetes.pyspark.pythonVersion is never passed to executors
[ https://issues.apache.org/jira/browse/SPARK-28652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16904729#comment-16904729 ] Liang-Chi Hsieh commented on SPARK-28652: - This looks interesting to me. I tried to look into existing tests. I think it is true that {{spark.kubernetes.pyspark.pythonVersion}} doesn't not pass into executors. But it looks correct and I think we don't need to pass it. The python version used by executors is come from Python side at driver, when wrapping a python function. PythonRunner will later serialize this variable when it is going to invoke python workers. PythonWorkerFactory also uses this variable to determine which python executable to run. So in executors, to run which python executable is not determined by PYSPARK_PYTHON. It means that we don't need to pass spark.kubernetes.pyspark.pythonVersion to executors, as this config is only used to choose PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON. cc [~hyukjin.kwon] too, in case if I miss something. > spark.kubernetes.pyspark.pythonVersion is never passed to executors > --- > > Key: SPARK-28652 > URL: https://issues.apache.org/jira/browse/SPARK-28652 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: nanav yorbiz >Priority: Major > > I suppose this may not be a priority with Python2 on its way out, but given > that this setting is only ever sent to the driver and not the executors, no > actual work can be performed when the versions don't match, which will tend > to be *always* with the default setting for the driver being changed from 2 > to 3, and the executors using `python`, which defaults to v2, by default. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause
[ https://issues.apache.org/jira/browse/SPARK-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899788#comment-16899788 ] Liang-Chi Hsieh commented on SPARK-28422: - Thanks [~dongjoon]! > GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause > --- > > Key: SPARK-28422 > URL: https://issues.apache.org/jira/browse/SPARK-28422 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3 >Reporter: Li Jin >Priority: Major > > > {code:python} > from pyspark.sql.functions import pandas_udf, PandasUDFType > @pandas_udf('double', PandasUDFType.GROUPED_AGG) > def max_udf(v): > return v.max() > df = spark.range(0, 100) > spark.udf.register('max_udf', max_udf) > df.createTempView('table') > # A. This works > df.agg(max_udf(df['id'])).show() > # B. This doesn't work > spark.sql("select max_udf(id) from table").show(){code} > > > Query plan: > A: > {code:java} > == Parsed Logical Plan == > 'Aggregate [max_udf('id) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Analyzed Logical Plan == > max_udf(id): double > Aggregate [max_udf(id#64L) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Optimized Logical Plan == > Aggregate [max_udf(id#64L) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Physical Plan == > !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140] > +- Exchange SinglePartition > +- *(1) Range (0, 1000, step=1, splits=4) > {code} > B: > {code:java} > == Parsed Logical Plan == > 'Project [unresolvedalias('max_udf('id), None)] > +- 'UnresolvedRelation [table] > == Analyzed Logical Plan == > max_udf(id): double > Project [max_udf(id#0L) AS max_udf(id)#136] > +- SubqueryAlias `table` > +- Range (0, 100, step=1, splits=Some(4)) > == Optimized Logical Plan == > Project [max_udf(id#0L) AS max_udf(id)#136] > +- Range (0, 100, step=1, splits=Some(4)) > == Physical Plan == > *(1) Project [max_udf(id#0L) AS max_udf(id)#136] > +- *(1) Range (0, 100, step=1, splits=4) > {code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24152) SparkR CRAN feasibility check server problem
[ https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16889742#comment-16889742 ] Liang-Chi Hsieh commented on SPARK-24152: - Ok. I think it was fixed. > SparkR CRAN feasibility check server problem > > > Key: SPARK-24152 > URL: https://issues.apache.org/jira/browse/SPARK-24152 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Liang-Chi Hsieh >Priority: Critical > > PR builder and master branch test fails with the following SparkR error with > unknown reason. The following is an error message from that. > {code} > * this is package 'SparkR' version '2.4.0' > * checking CRAN incoming feasibility ...Error in > .check_package_CRAN_incoming(pkgdir) : > dims [product 24] do not match the length of object [0] > Execution halted > {code} > *PR BUILDER* > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/ > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/ > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/ > *MASTER BRANCH* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/ > (Fail with no failures) > This is critical because we already start to merge the PR by ignoring this > **known unkonwn** SparkR failure. > - https://github.com/apache/spark/pull/21175 -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24152) SparkR CRAN feasibility check server problem
[ https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16889671#comment-16889671 ] Liang-Chi Hsieh commented on SPARK-24152: - This CRAN issue is happening now, again. Emailed to CRAN admins for help. Will update after they reply. > SparkR CRAN feasibility check server problem > > > Key: SPARK-24152 > URL: https://issues.apache.org/jira/browse/SPARK-24152 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Liang-Chi Hsieh >Priority: Critical > > PR builder and master branch test fails with the following SparkR error with > unknown reason. The following is an error message from that. > {code} > * this is package 'SparkR' version '2.4.0' > * checking CRAN incoming feasibility ...Error in > .check_package_CRAN_incoming(pkgdir) : > dims [product 24] do not match the length of object [0] > Execution halted > {code} > *PR BUILDER* > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/ > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/ > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/ > *MASTER BRANCH* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/ > (Fail with no failures) > This is critical because we already start to merge the PR by ignoring this > **known unkonwn** SparkR failure. > - https://github.com/apache/spark/pull/21175 -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28441) PythonUDF used in correlated scalar subquery causes
[ https://issues.apache.org/jira/browse/SPARK-28441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-28441: Summary: PythonUDF used in correlated scalar subquery causes (was: udf(max(udf(column))) throws java.lang.UnsupportedOperationException: Cannot evaluate expression: udf(null)) > PythonUDF used in correlated scalar subquery causes > > > Key: SPARK-28441 > URL: https://issues.apache.org/jira/browse/SPARK-28441 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Minor > > I found this when doing https://issues.apache.org/jira/browse/SPARK-28277 > > {code:java} > >>> @pandas_udf("string", PandasUDFType.SCALAR) > ... def noop(x): > ... return x.apply(str) > ... > >>> spark.udf.register("udf", noop) > > >>> spark.sql("CREATE OR REPLACE TEMPORARY VIEW t1 as select * from values > >>> (\"one\", 1), (\"two\", 2),(\"three\", 3),(\"one\", NULL) as t1(k, v)") > DataFrame[] > >>> spark.sql("CREATE OR REPLACE TEMPORARY VIEW t2 as select * from values > >>> (\"one\", 1), (\"two\", 22),(\"one\", 5),(\"one\", NULL), (NULL, 5) as > >>> t2(k, v)") > DataFrame[] > >>> spark.sql("SELECT t1.k FROM t1 WHERE t1.v <= (SELECT > >>> udf(max(udf(t2.v))) FROM t2 WHERE udf(t2.k) = udf(t1.k))").show() > py4j.protocol.Py4JJavaError: An error occurred while calling o65.showString. > : java.lang.UnsupportedOperationException: Cannot evaluate expression: > udf(null) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.eval(Expression.scala:296) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.eval$(Expression.scala:295) > at > org.apache.spark.sql.catalyst.expressions.PythonUDF.eval(PythonUDF.scala:52) > {code} > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28441) PythonUDF used in correlated scalar subquery causes UnsupportedOperationException
[ https://issues.apache.org/jira/browse/SPARK-28441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-28441: Summary: PythonUDF used in correlated scalar subquery causes UnsupportedOperationException (was: PythonUDF used in correlated scalar subquery causes ) > PythonUDF used in correlated scalar subquery causes > UnsupportedOperationException > -- > > Key: SPARK-28441 > URL: https://issues.apache.org/jira/browse/SPARK-28441 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Minor > > I found this when doing https://issues.apache.org/jira/browse/SPARK-28277 > > {code:java} > >>> @pandas_udf("string", PandasUDFType.SCALAR) > ... def noop(x): > ... return x.apply(str) > ... > >>> spark.udf.register("udf", noop) > > >>> spark.sql("CREATE OR REPLACE TEMPORARY VIEW t1 as select * from values > >>> (\"one\", 1), (\"two\", 2),(\"three\", 3),(\"one\", NULL) as t1(k, v)") > DataFrame[] > >>> spark.sql("CREATE OR REPLACE TEMPORARY VIEW t2 as select * from values > >>> (\"one\", 1), (\"two\", 22),(\"one\", 5),(\"one\", NULL), (NULL, 5) as > >>> t2(k, v)") > DataFrame[] > >>> spark.sql("SELECT t1.k FROM t1 WHERE t1.v <= (SELECT > >>> udf(max(udf(t2.v))) FROM t2 WHERE udf(t2.k) = udf(t1.k))").show() > py4j.protocol.Py4JJavaError: An error occurred while calling o65.showString. > : java.lang.UnsupportedOperationException: Cannot evaluate expression: > udf(null) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.eval(Expression.scala:296) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.eval$(Expression.scala:295) > at > org.apache.spark.sql.catalyst.expressions.PythonUDF.eval(PythonUDF.scala:52) > {code} > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28441) PythonUDF used in correlated scalar subquery causes UnsupportedOperationException
[ https://issues.apache.org/jira/browse/SPARK-28441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-28441: Priority: Major (was: Minor) > PythonUDF used in correlated scalar subquery causes > UnsupportedOperationException > -- > > Key: SPARK-28441 > URL: https://issues.apache.org/jira/browse/SPARK-28441 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Major > > I found this when doing https://issues.apache.org/jira/browse/SPARK-28277 > > {code:java} > >>> @pandas_udf("string", PandasUDFType.SCALAR) > ... def noop(x): > ... return x.apply(str) > ... > >>> spark.udf.register("udf", noop) > > >>> spark.sql("CREATE OR REPLACE TEMPORARY VIEW t1 as select * from values > >>> (\"one\", 1), (\"two\", 2),(\"three\", 3),(\"one\", NULL) as t1(k, v)") > DataFrame[] > >>> spark.sql("CREATE OR REPLACE TEMPORARY VIEW t2 as select * from values > >>> (\"one\", 1), (\"two\", 22),(\"one\", 5),(\"one\", NULL), (NULL, 5) as > >>> t2(k, v)") > DataFrame[] > >>> spark.sql("SELECT t1.k FROM t1 WHERE t1.v <= (SELECT > >>> udf(max(udf(t2.v))) FROM t2 WHERE udf(t2.k) = udf(t1.k))").show() > py4j.protocol.Py4JJavaError: An error occurred while calling o65.showString. > : java.lang.UnsupportedOperationException: Cannot evaluate expression: > udf(null) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.eval(Expression.scala:296) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.eval$(Expression.scala:295) > at > org.apache.spark.sql.catalyst.expressions.PythonUDF.eval(PythonUDF.scala:52) > {code} > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28288) Convert and port 'window.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16887707#comment-16887707 ] Liang-Chi Hsieh commented on SPARK-28288: - Those errors can be found in original window.sql. Seems fine. > Convert and port 'window.sql' into UDF test base > > > Key: SPARK-28288 > URL: https://issues.apache.org/jira/browse/SPARK-28288 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28365) Fallback locale to en_US in StopWordsRemover if system default locale isn't in available locales in JVM
[ https://issues.apache.org/jira/browse/SPARK-28365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-28365: Summary: Fallback locale to en_US in StopWordsRemover if system default locale isn't in available locales in JVM (was: Set default locale param for StopWordsRemover to en_US if system default locale isn't in available locales in JVM) > Fallback locale to en_US in StopWordsRemover if system default locale isn't > in available locales in JVM > --- > > Key: SPARK-28365 > URL: https://issues.apache.org/jira/browse/SPARK-28365 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > Because the local default locale isn't in available locales at {{Locale}}, > when I did some tests locally with python code, {{StopWordsRemover}} related > python test hits some errors, like: > {code} > Traceback (most recent call last): > File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in > test_stopwordsremover > stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output") > File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper > return func(self, **kwargs) > File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__ > self.uid) > File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj > return java_obj(*java_args) > File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line > 1554, in __call__ > answer, self._gateway_client, None, self._fqn) > File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco > raise converted > pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 > parameter locale given invalid value en_TW.' > {code} > As per [~hyukjin.kwon]'s advice, instead of setting up locale to pass test, > it is better to have a workable locale if system default locale can't be > found in available locales in JVM. Otherwise, users have to manually change > system locale or accessing a private property _jvm in PySpark. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28365) Set default locale param for StopWordsRemover to en_US if system default locale isn't in available locales in JVM
[ https://issues.apache.org/jira/browse/SPARK-28365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-28365: Summary: Set default locale param for StopWordsRemover to en_US if system default locale isn't in available locales in JVM (was: Set default locale for StopWordsRemover tests to prevent invalid locale error during test) > Set default locale param for StopWordsRemover to en_US if system default > locale isn't in available locales in JVM > - > > Key: SPARK-28365 > URL: https://issues.apache.org/jira/browse/SPARK-28365 > Project: Spark > Issue Type: Test > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Minor > > Because the local default locale isn't in available locales at {{Locale}}, > when I did some tests locally with python code, {{StopWordsRemover}} related > python test hits some errors, like: > {code} > Traceback (most recent call last): > File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in > test_stopwordsremover > stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output") > File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper > return func(self, **kwargs) > File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__ > self.uid) > File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj > return java_obj(*java_args) > File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line > 1554, in __call__ > answer, self._gateway_client, None, self._fqn) > File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco > raise converted > pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 > parameter locale given invalid value en_TW.' > {code} > As per [~hyukjin.kwon]'s advice, instead of setting up locale to pass test, > it is better to have a workable locale if system default locale can't be > found in available locales in JVM. Otherwise, users have to manually change > system locale or accessing a private property _jvm in PySpark. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28365) Set default locale param for StopWordsRemover to en_US if system default locale isn't in available locales in JVM
[ https://issues.apache.org/jira/browse/SPARK-28365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-28365: Priority: Major (was: Minor) > Set default locale param for StopWordsRemover to en_US if system default > locale isn't in available locales in JVM > - > > Key: SPARK-28365 > URL: https://issues.apache.org/jira/browse/SPARK-28365 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > Because the local default locale isn't in available locales at {{Locale}}, > when I did some tests locally with python code, {{StopWordsRemover}} related > python test hits some errors, like: > {code} > Traceback (most recent call last): > File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in > test_stopwordsremover > stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output") > File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper > return func(self, **kwargs) > File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__ > self.uid) > File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj > return java_obj(*java_args) > File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line > 1554, in __call__ > answer, self._gateway_client, None, self._fqn) > File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco > raise converted > pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 > parameter locale given invalid value en_TW.' > {code} > As per [~hyukjin.kwon]'s advice, instead of setting up locale to pass test, > it is better to have a workable locale if system default locale can't be > found in available locales in JVM. Otherwise, users have to manually change > system locale or accessing a private property _jvm in PySpark. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28365) Set default locale param for StopWordsRemover to en_US if system default locale isn't in available locales in JVM
[ https://issues.apache.org/jira/browse/SPARK-28365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-28365: Component/s: (was: PySpark) > Set default locale param for StopWordsRemover to en_US if system default > locale isn't in available locales in JVM > - > > Key: SPARK-28365 > URL: https://issues.apache.org/jira/browse/SPARK-28365 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > Because the local default locale isn't in available locales at {{Locale}}, > when I did some tests locally with python code, {{StopWordsRemover}} related > python test hits some errors, like: > {code} > Traceback (most recent call last): > File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in > test_stopwordsremover > stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output") > File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper > return func(self, **kwargs) > File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__ > self.uid) > File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj > return java_obj(*java_args) > File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line > 1554, in __call__ > answer, self._gateway_client, None, self._fqn) > File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco > raise converted > pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 > parameter locale given invalid value en_TW.' > {code} > As per [~hyukjin.kwon]'s advice, instead of setting up locale to pass test, > it is better to have a workable locale if system default locale can't be > found in available locales in JVM. Otherwise, users have to manually change > system locale or accessing a private property _jvm in PySpark. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28365) Set default locale param for StopWordsRemover to en_US if system default locale isn't in available locales in JVM
[ https://issues.apache.org/jira/browse/SPARK-28365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-28365: Issue Type: Bug (was: Test) > Set default locale param for StopWordsRemover to en_US if system default > locale isn't in available locales in JVM > - > > Key: SPARK-28365 > URL: https://issues.apache.org/jira/browse/SPARK-28365 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Minor > > Because the local default locale isn't in available locales at {{Locale}}, > when I did some tests locally with python code, {{StopWordsRemover}} related > python test hits some errors, like: > {code} > Traceback (most recent call last): > File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in > test_stopwordsremover > stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output") > File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper > return func(self, **kwargs) > File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__ > self.uid) > File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj > return java_obj(*java_args) > File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line > 1554, in __call__ > answer, self._gateway_client, None, self._fqn) > File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco > raise converted > pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 > parameter locale given invalid value en_TW.' > {code} > As per [~hyukjin.kwon]'s advice, instead of setting up locale to pass test, > it is better to have a workable locale if system default locale can't be > found in available locales in JVM. Otherwise, users have to manually change > system locale or accessing a private property _jvm in PySpark. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28365) Set default locale for StopWordsRemover tests to prevent invalid locale error during test
[ https://issues.apache.org/jira/browse/SPARK-28365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-28365: Component/s: (was: Tests) ML > Set default locale for StopWordsRemover tests to prevent invalid locale error > during test > - > > Key: SPARK-28365 > URL: https://issues.apache.org/jira/browse/SPARK-28365 > Project: Spark > Issue Type: Test > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Minor > > Because the local default locale isn't in available locales at {{Locale}}, > when I did some tests locally with python code, {{StopWordsRemover}} related > python test hits some errors, like: > {code} > Traceback (most recent call last): > File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in > test_stopwordsremover > stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output") > File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper > return func(self, **kwargs) > File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__ > self.uid) > File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj > return java_obj(*java_args) > File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line > 1554, in __call__ > answer, self._gateway_client, None, self._fqn) > File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco > raise converted > pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 > parameter locale given invalid value en_TW.' > {code} > As per [~hyukjin.kwon]'s advice, instead of setting up locale to pass test, > it is better to have a workable locale if system default locale can't be > found in available locales in JVM. Otherwise, users have to manually change > system locale or accessing a private property _jvm in PySpark. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28365) Set default locale for StopWordsRemover tests to prevent invalid locale error during test
[ https://issues.apache.org/jira/browse/SPARK-28365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-28365: Description: Because the local default locale isn't in available locales at {{Locale}}, when I did some tests locally with python code, {{StopWordsRemover}} related python test hits some errors, like: {code} Traceback (most recent call last): File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in test_stopwordsremover stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output") File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper return func(self, **kwargs) File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__ self.uid) File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj return java_obj(*java_args) File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1554, in __call__ answer, self._gateway_client, None, self._fqn) File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco raise converted pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 parameter locale given invalid value en_TW.' {code} As per [~hyukjin.kwon]'s advice, instead of setting up locale to pass test, it is better to have a workable locale if system default locale can't be found in available locales in JVM. Otherwise, users have to manually change system locale or accessing a private property _jvm in PySpark. was: Because the local default locale isn't in available locales at {{Locale}}, when I did some tests locally with python code, {{StopWordsRemover}} related python test hits some errors, like: {code} Traceback (most recent call last): File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in test_stopwordsremover stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output") File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper return func(self, **kwargs) File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__ self.uid) File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj return java_obj(*java_args) File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1554, in __call__ answer, self._gateway_client, None, self._fqn) File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco raise converted pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 parameter locale given invalid value en_TW.' {code} > Set default locale for StopWordsRemover tests to prevent invalid locale error > during test > - > > Key: SPARK-28365 > URL: https://issues.apache.org/jira/browse/SPARK-28365 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Minor > > Because the local default locale isn't in available locales at {{Locale}}, > when I did some tests locally with python code, {{StopWordsRemover}} related > python test hits some errors, like: > {code} > Traceback (most recent call last): > File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in > test_stopwordsremover > stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output") > File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper > return func(self, **kwargs) > File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__ > self.uid) > File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj > return java_obj(*java_args) > File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line > 1554, in __call__ > answer, self._gateway_client, None, self._fqn) > File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco > raise converted > pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 > parameter locale given invalid value en_TW.' > {code} > As per [~hyukjin.kwon]'s advice, instead of setting up locale to pass test, > it is better to have a workable locale if system default locale can't be > found in available locales in JVM. Otherwise, users have to manually change > system locale or accessing a private property _jvm in PySpark. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28381) Upgraded version of Pyrolite to 4.30
Liang-Chi Hsieh created SPARK-28381: --- Summary: Upgraded version of Pyrolite to 4.30 Key: SPARK-28381 URL: https://issues.apache.org/jira/browse/SPARK-28381 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh This upgraded to a newer version of Pyrolite. Most updates in the newer version are for dotnet. For java, it includes a bug fix to Unpickler regarding cleaning up Unpickler memo, and support of protocol 5. After upgrading, we can remove the fix at SPARK-27629 for the bug in Unpickler. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28378) Remove usage of cgi.escape
Liang-Chi Hsieh created SPARK-28378: --- Summary: Remove usage of cgi.escape Key: SPARK-28378 URL: https://issues.apache.org/jira/browse/SPARK-28378 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh {{cgi.escape}} is deprecated [1], and removed at 3.8 [2]. We better to replace it. [1] [https://docs.python.org/3/library/cgi.html#cgi.escape]. [2] [https://docs.python.org/3.8/whatsnew/3.8.html#api-and-feature-removals] -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28365) Set default locale for StopWordsRemover tests to prevent invalid locale error during test
Liang-Chi Hsieh created SPARK-28365: --- Summary: Set default locale for StopWordsRemover tests to prevent invalid locale error during test Key: SPARK-28365 URL: https://issues.apache.org/jira/browse/SPARK-28365 Project: Spark Issue Type: Test Components: PySpark, Tests Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh Because the local default locale isn't in available locales at {{Locale}}, when I did some tests locally with python code, {{StopWordsRemover}} related python test hits some errors, like: {code} Traceback (most recent call last): File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in test_stopwordsremover stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output") File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper return func(self, **kwargs) File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__ self.uid) File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj return java_obj(*java_args) File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1554, in __call__ answer, self._gateway_client, None, self._fqn) File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco raise converted pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 parameter locale given invalid value en_TW.' {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28345) PythonUDF predicate should be able to pushdown to join
[ https://issues.apache.org/jira/browse/SPARK-28345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882712#comment-16882712 ] Liang-Chi Hsieh commented on SPARK-28345: - I found this when doing SPARK-28276. > PythonUDF predicate should be able to pushdown to join > -- > > Key: SPARK-28345 > URL: https://issues.apache.org/jira/browse/SPARK-28345 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > A Filter predicate using PythonUDF can't be push down into join condition, > currently. A predicate like that should be able to push down to join > condition. For PythonUDFs that can't be evaluated in join condition, > {{PullOutPythonUDFInJoinCondition}} will pull them out later. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28345) PythonUDF predicate should be able to pushdown to join
Liang-Chi Hsieh created SPARK-28345: --- Summary: PythonUDF predicate should be able to pushdown to join Key: SPARK-28345 URL: https://issues.apache.org/jira/browse/SPARK-28345 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh A Filter predicate using PythonUDF can't be push down into join condition, currently. A predicate like that should be able to push down to join condition. For PythonUDFs that can't be evaluated in join condition, {{PullOutPythonUDFInJoinCondition}} will pull them out later. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28323) PythonUDF should be able to use in join condition
[ https://issues.apache.org/jira/browse/SPARK-28323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881717#comment-16881717 ] Liang-Chi Hsieh commented on SPARK-28323: - I found this bug when doing SPARK-28276. > PythonUDF should be able to use in join condition > - > > Key: SPARK-28323 > URL: https://issues.apache.org/jira/browse/SPARK-28323 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > There is a bug in {{ExtractPythonUDFs}} that produces wrong result > attributes. It causes a failure when using PythonUDFs among multiple child > plans, e.g., join. An example is using PythonUDFs in join condition. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28323) PythonUDF should be able to use in join condition
Liang-Chi Hsieh created SPARK-28323: --- Summary: PythonUDF should be able to use in join condition Key: SPARK-28323 URL: https://issues.apache.org/jira/browse/SPARK-28323 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh There is a bug in {{ExtractPythonUDFs}} that produces wrong result attributes. It causes a failure when using PythonUDFs among multiple child plans, e.g., join. An example is using PythonUDFs in join condition. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28276) Convert and port 'cross-join.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881033#comment-16881033 ] Liang-Chi Hsieh commented on SPARK-28276: - Will look into this. > Convert and port 'cross-join.sql' into UDF test base > > > Key: SPARK-28276 > URL: https://issues.apache.org/jira/browse/SPARK-28276 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24152) SparkR CRAN feasibility check server problem
[ https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877497#comment-16877497 ] Liang-Chi Hsieh edited comment on SPARK-24152 at 7/3/19 5:19 AM: - Received reply that is cleaned up. So I think it is fine now. was (Author: viirya): Received reply that is cleaned up. > SparkR CRAN feasibility check server problem > > > Key: SPARK-24152 > URL: https://issues.apache.org/jira/browse/SPARK-24152 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Liang-Chi Hsieh >Priority: Critical > > PR builder and master branch test fails with the following SparkR error with > unknown reason. The following is an error message from that. > {code} > * this is package 'SparkR' version '2.4.0' > * checking CRAN incoming feasibility ...Error in > .check_package_CRAN_incoming(pkgdir) : > dims [product 24] do not match the length of object [0] > Execution halted > {code} > *PR BUILDER* > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/ > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/ > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/ > *MASTER BRANCH* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/ > (Fail with no failures) > This is critical because we already start to merge the PR by ignoring this > **known unkonwn** SparkR failure. > - https://github.com/apache/spark/pull/21175 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24152) SparkR CRAN feasibility check server problem
[ https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877497#comment-16877497 ] Liang-Chi Hsieh commented on SPARK-24152: - Received reply that is cleaned up. > SparkR CRAN feasibility check server problem > > > Key: SPARK-24152 > URL: https://issues.apache.org/jira/browse/SPARK-24152 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Liang-Chi Hsieh >Priority: Critical > > PR builder and master branch test fails with the following SparkR error with > unknown reason. The following is an error message from that. > {code} > * this is package 'SparkR' version '2.4.0' > * checking CRAN incoming feasibility ...Error in > .check_package_CRAN_incoming(pkgdir) : > dims [product 24] do not match the length of object [0] > Execution halted > {code} > *PR BUILDER* > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/ > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/ > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/ > *MASTER BRANCH* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/ > (Fail with no failures) > This is critical because we already start to merge the PR by ignoring this > **known unkonwn** SparkR failure. > - https://github.com/apache/spark/pull/21175 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24152) SparkR CRAN feasibility check server problem
[ https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877401#comment-16877401 ] Liang-Chi Hsieh commented on SPARK-24152: - I noticed that this issue happens now again. Contacted CRAN admin and asked for help. Will update when they reply. > SparkR CRAN feasibility check server problem > > > Key: SPARK-24152 > URL: https://issues.apache.org/jira/browse/SPARK-24152 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Liang-Chi Hsieh >Priority: Critical > > PR builder and master branch test fails with the following SparkR error with > unknown reason. The following is an error message from that. > {code} > * this is package 'SparkR' version '2.4.0' > * checking CRAN incoming feasibility ...Error in > .check_package_CRAN_incoming(pkgdir) : > dims [product 24] do not match the length of object [0] > Execution halted > {code} > *PR BUILDER* > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/ > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/ > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/ > *MASTER BRANCH* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/ > (Fail with no failures) > This is critical because we already start to merge the PR by ignoring this > **known unkonwn** SparkR failure. > - https://github.com/apache/spark/pull/21175 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28215) as_tibble was removed from Arrow R API
Liang-Chi Hsieh created SPARK-28215: --- Summary: as_tibble was removed from Arrow R API Key: SPARK-28215 URL: https://issues.apache.org/jira/browse/SPARK-28215 Project: Spark Issue Type: Bug Components: R Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh New R api of Arrow has removed `as_tibble`. Arrow optimized collect in R doesn't work now due to the change. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22340) pyspark setJobGroup doesn't match java threads
[ https://issues.apache.org/jira/browse/SPARK-22340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872852#comment-16872852 ] Liang-Chi Hsieh commented on SPARK-22340: - [~hyukjin.kwon] Should we reopen this as you are open a PR for it now? > pyspark setJobGroup doesn't match java threads > -- > > Key: SPARK-22340 > URL: https://issues.apache.org/jira/browse/SPARK-22340 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.2 >Reporter: Leif Walsh >Priority: Major > Labels: bulk-closed > > With pyspark, {{sc.setJobGroup}}'s documentation says > {quote} > Assigns a group ID to all the jobs started by this thread until the group ID > is set to a different value or cleared. > {quote} > However, this doesn't appear to be associated with Python threads, only with > Java threads. As such, a Python thread which calls this and then submits > multiple jobs doesn't necessarily get its jobs associated with any particular > spark job group. For example: > {code} > def run_jobs(): > sc.setJobGroup('hello', 'hello jobs') > x = sc.range(100).sum() > y = sc.range(1000).sum() > return x, y > import concurrent.futures > with concurrent.futures.ThreadPoolExecutor() as executor: > future = executor.submit(run_jobs) > sc.cancelJobGroup('hello') > future.result() > {code} > In this example, depending how the action calls on the Python side are > allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be > assigned the job group {{hello}}. > First, we should clarify the docs if this truly is the case. > Second, it would be really helpful if we could make the job group assignment > reliable for a Python thread, though I’m not sure the best way to do this. > As it stands, job groups are pretty useless from the pyspark side, if we > can't rely on this fact. > My only idea so far is to mimic the TLS behavior on the Python side and then > patch every point where job submission may take place to pass that in, but > this feels pretty brittle. In my experience with py4j, controlling threading > there is a challenge. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28079) CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is manually added to the schema
[ https://issues.apache.org/jira/browse/SPARK-28079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16868577#comment-16868577 ] Liang-Chi Hsieh commented on SPARK-28079: - {{columnNameOfCorruptRecord}} currently applied only when users explicitly specify it in the user-defined schema, as documented. I think you are proposing to let Spark SQL add {{columnNameOfCorruptRecord}} column automatically, without explicit user-defined schema? > CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is > manually added to the schema > - > > Key: SPARK-28079 > URL: https://issues.apache.org/jira/browse/SPARK-28079 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.4.3 >Reporter: F Jimenez >Priority: Major > > When reading a CSV with mode = "PERMISSIVE", corrupt records are not flagged > as such and read in. Only way to get them flagged is to manually set > "columnNameOfCorruptRecord" AND manually setting the schema including this > column. Example: > {code:java} > // Second row has a 4th column that is not declared in the header/schema > val csvText = s""" > | FieldA, FieldB, FieldC > | a1,b1,c1 > | a2,b2,c2,d*""".stripMargin > val csvFile = new File("/tmp/file.csv") > FileUtils.write(csvFile, csvText) > val reader = sqlContext.read > .format("csv") > .option("header", "true") > .option("mode", "PERMISSIVE") > .option("columnNameOfCorruptRecord", "corrupt") > .schema("corrupt STRING, fieldA STRING, fieldB STRING, fieldC STRING") > reader.load(csvFile.getAbsolutePath).show(truncate = false) > {code} > This produces the correct result: > {code:java} > ++--+--+--+ > |corrupt |fieldA|fieldB|fieldC| > ++--+--+--+ > |null | a1 |b1 |c1 | > | a2,b2,c2,d*| a2 |b2 |c2 | > ++--+--+--+ > {code} > However removing the "schema" option and going: > {code:java} > val reader = sqlContext.read > .format("csv") > .option("header", "true") > .option("mode", "PERMISSIVE") > .option("columnNameOfCorruptRecord", "corrupt") > reader.load(csvFile.getAbsolutePath).show(truncate = false) > {code} > Yields: > {code:java} > +---+---+---+ > | FieldA| FieldB| FieldC| > +---+---+---+ > | a1 |b1 |c1 | > | a2 |b2 |c2 | > +---+---+---+ > {code} > The fourth value "d*" in the second row has been removed and the row not > marked as corrupt > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27946) Hive DDL to Spark DDL conversion USING "show create table"
[ https://issues.apache.org/jira/browse/SPARK-27946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16867771#comment-16867771 ] Liang-Chi Hsieh commented on SPARK-27946: - [~smilegator] Thanks for pinging me. I'd like to do, although I'm a little busy these days. I will try to work on it in this weekend. If others are interested to work on this, please take it. > Hive DDL to Spark DDL conversion USING "show create table" > -- > > Key: SPARK-27946 > URL: https://issues.apache.org/jira/browse/SPARK-27946 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > Many users migrate tables created with Hive DDL to Spark. Defining the table > with Spark DDL brings performance benefits. We need to add a feature to Show > Create Table that allows you to generate Spark DDL for a table. For example: > `SHOW CREATE TABLE customers AS SPARK`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28079) CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is manually added to the schema
[ https://issues.apache.org/jira/browse/SPARK-28079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16866777#comment-16866777 ] Liang-Chi Hsieh commented on SPARK-28079: - Isn't it the expected behavior as documented in {{PERMISSIVE}} mode of CSV? > CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is > manually added to the schema > - > > Key: SPARK-28079 > URL: https://issues.apache.org/jira/browse/SPARK-28079 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.4.3 >Reporter: F Jimenez >Priority: Major > > When reading a CSV with mode = "PERMISSIVE", corrupt records are not flagged > as such and read in. Only way to get them flagged is to manually set > "columnNameOfCorruptRecord" AND manually setting the schema including this > column. Example: > {code:java} > // Second row has a 4th column that is not declared in the header/schema > val csvText = s""" > | FieldA, FieldB, FieldC > | a1,b1,c1 > | a2,b2,c2,d*""".stripMargin > val csvFile = new File("/tmp/file.csv") > FileUtils.write(csvFile, csvText) > val reader = sqlContext.read > .format("csv") > .option("header", "true") > .option("mode", "PERMISSIVE") > .option("columnNameOfCorruptRecord", "corrupt") > .schema("corrupt STRING, fieldA STRING, fieldB STRING, fieldC STRING") > reader.load(csvFile.getAbsolutePath).show(truncate = false) > {code} > This produces the correct result: > {code:java} > ++--+--+--+ > |corrupt |fieldA|fieldB|fieldC| > ++--+--+--+ > |null | a1 |b1 |c1 | > | a2,b2,c2,d*| a2 |b2 |c2 | > ++--+--+--+ > {code} > However removing the "schema" option and going: > {code:java} > val reader = sqlContext.read > .format("csv") > .option("header", "true") > .option("mode", "PERMISSIVE") > .option("columnNameOfCorruptRecord", "corrupt") > reader.load(csvFile.getAbsolutePath).show(truncate = false) > {code} > Yields: > {code:java} > +---+---+---+ > | FieldA| FieldB| FieldC| > +---+---+---+ > | a1 |b1 |c1 | > | a2 |b2 |c2 | > +---+---+---+ > {code} > The fourth value "d*" in the second row has been removed and the row not > marked as corrupt > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
[ https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865714#comment-16865714 ] Liang-Chi Hsieh edited comment on SPARK-28058 at 6/17/19 3:59 PM: -- [~hyukjin.kwon] Do you mean this is suspect to be a bug: {code} scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false) +--+--+ |fruit |color | +--+--+ |apple |red | |banana|yellow| |orange|orange| |xxx |null | +--+--+ {code} In this case, the reader should read two columns. But the corrupted record has only one column. Reasonably, it should be dropped as a malformed one. But we see the missing column is filled with null. This seems to be inherited from Univocity parser, when we use {{CsvParserSettings.selectIndexes}} to do field selection. In above case, the parser returns two tokens where the second token is just null. I'm not sure if it is known behavior of Univocity parser, or it is a bug at Univocity parser. was (Author: viirya): [~hyukjin.kwon] Do you mean this is suspect to be a bug: {code} scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false) +--+--+ |fruit |color | +--+--+ |apple |red | |banana|yellow| |orange|orange| |xxx |null | +--+--+ {code} In this case, the reader should read two columns. But the corrupted record has only one column. Reasonably, it should be dropped as a malformed one. But we see the missing column is filled with null. > Reading csv with DROPMALFORMED sometimes doesn't drop malformed records > --- > > Key: SPARK-28058 > URL: https://issues.apache.org/jira/browse/SPARK-28058 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1, 2.4.3 >Reporter: Stuart White >Priority: Minor > Labels: CSV, csv, csvparser > > The spark sql csv reader is not dropping malformed records as expected. > Consider this file (fruit.csv). Notice it contains a header record, 3 valid > records, and one malformed record. > {noformat} > fruit,color,price,quantity > apple,red,1,3 > banana,yellow,2,4 > orange,orange,3,5 > xxx > {noformat} > If I read this file using the spark sql csv reader as follows, everything > looks good. The malformed record is dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").show(truncate=false) > +--+--+-++ > > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > However, if I select a subset of the columns, the malformed record is not > dropped. The malformed data is placed in the first column, and the remaining > column(s) are filled with nulls. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false) > +--+ > |fruit | > +--+ > |apple | > |banana| > |orange| > |xxx | > +--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false) > +--+--+ > |fruit |color | > +--+--+ > |apple |red | > |banana|yellow| > |orange|orange| > |xxx |null | > +--+--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, > 'price).show(truncate=false) > +--+--+-+ > |fruit |color |price| > +--+--+-+ > |apple |red |1| > |banana|yellow|2| > |orange|orange|3| > |xxx |null |null | > +--+--+-+ > {noformat} > And finally, if I manually select all of the columns, the malformed record is > once again dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, > 'quantity).show(truncate=false) > +--+--+-++ > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > I would expect the malformed record(s) to be dropped regardless of which > columns are being selected from the file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mai
[jira] [Commented] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
[ https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865714#comment-16865714 ] Liang-Chi Hsieh commented on SPARK-28058: - [~hyukjin.kwon] Do you mean this is suspect to be a bug: {code} scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false) +--+--+ |fruit |color | +--+--+ |apple |red | |banana|yellow| |orange|orange| |xxx |null | +--+--+ {code} In this case, the reader should read two columns. But the corrupted record has only one column. Reasonably, it should be dropped as a malformed one. But we see the missing column is filled with null. > Reading csv with DROPMALFORMED sometimes doesn't drop malformed records > --- > > Key: SPARK-28058 > URL: https://issues.apache.org/jira/browse/SPARK-28058 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1, 2.4.3 >Reporter: Stuart White >Priority: Minor > Labels: CSV, csv, csvparser > > The spark sql csv reader is not dropping malformed records as expected. > Consider this file (fruit.csv). Notice it contains a header record, 3 valid > records, and one malformed record. > {noformat} > fruit,color,price,quantity > apple,red,1,3 > banana,yellow,2,4 > orange,orange,3,5 > xxx > {noformat} > If I read this file using the spark sql csv reader as follows, everything > looks good. The malformed record is dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").show(truncate=false) > +--+--+-++ > > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > However, if I select a subset of the columns, the malformed record is not > dropped. The malformed data is placed in the first column, and the remaining > column(s) are filled with nulls. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false) > +--+ > |fruit | > +--+ > |apple | > |banana| > |orange| > |xxx | > +--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false) > +--+--+ > |fruit |color | > +--+--+ > |apple |red | > |banana|yellow| > |orange|orange| > |xxx |null | > +--+--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, > 'price).show(truncate=false) > +--+--+-+ > |fruit |color |price| > +--+--+-+ > |apple |red |1| > |banana|yellow|2| > |orange|orange|3| > |xxx |null |null | > +--+--+-+ > {noformat} > And finally, if I manually select all of the columns, the malformed record is > once again dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, > 'quantity).show(truncate=false) > +--+--+-++ > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > I would expect the malformed record(s) to be dropped regardless of which > columns are being selected from the file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
[ https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865695#comment-16865695 ] Liang-Chi Hsieh commented on SPARK-28058: - [~stwhit] Thanks for letting us know that! Although it is in the migration guide, for new users it should be good to have the note in {{DROPMALFORMED}} doc. I filed SPARK-28082 to track the doc improvement. > Reading csv with DROPMALFORMED sometimes doesn't drop malformed records > --- > > Key: SPARK-28058 > URL: https://issues.apache.org/jira/browse/SPARK-28058 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1, 2.4.3 >Reporter: Stuart White >Priority: Minor > Labels: CSV, csv, csvparser > > The spark sql csv reader is not dropping malformed records as expected. > Consider this file (fruit.csv). Notice it contains a header record, 3 valid > records, and one malformed record. > {noformat} > fruit,color,price,quantity > apple,red,1,3 > banana,yellow,2,4 > orange,orange,3,5 > xxx > {noformat} > If I read this file using the spark sql csv reader as follows, everything > looks good. The malformed record is dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").show(truncate=false) > +--+--+-++ > > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > However, if I select a subset of the columns, the malformed record is not > dropped. The malformed data is placed in the first column, and the remaining > column(s) are filled with nulls. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false) > +--+ > |fruit | > +--+ > |apple | > |banana| > |orange| > |xxx | > +--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false) > +--+--+ > |fruit |color | > +--+--+ > |apple |red | > |banana|yellow| > |orange|orange| > |xxx |null | > +--+--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, > 'price).show(truncate=false) > +--+--+-+ > |fruit |color |price| > +--+--+-+ > |apple |red |1| > |banana|yellow|2| > |orange|orange|3| > |xxx |null |null | > +--+--+-+ > {noformat} > And finally, if I manually select all of the columns, the malformed record is > once again dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, > 'quantity).show(truncate=false) > +--+--+-++ > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > I would expect the malformed record(s) to be dropped regardless of which > columns are being selected from the file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28082) Add a note to DROPMALFORMED mode of CSV for column pruning
Liang-Chi Hsieh created SPARK-28082: --- Summary: Add a note to DROPMALFORMED mode of CSV for column pruning Key: SPARK-28082 URL: https://issues.apache.org/jira/browse/SPARK-28082 Project: Spark Issue Type: Documentation Components: PySpark, SQL Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh This is inspired by SPARK-28058. When using {{DROPMALFORMED}} mode, corrupted records aren't dropped if malformed columns aren't read. This behavior is due to CSV parser column pruning. Current doc of {{DROPMALFORMED}} doesn't mention the effect of column pruning. Users will be confused by the fact that {{DROPMALFORMED}} mode doesn't work as expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
[ https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865664#comment-16865664 ] Liang-Chi Hsieh commented on SPARK-28058: - Although this isn't a bug, I think it might be worth adding a note to current doc to explain/clarify it. > Reading csv with DROPMALFORMED sometimes doesn't drop malformed records > --- > > Key: SPARK-28058 > URL: https://issues.apache.org/jira/browse/SPARK-28058 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1, 2.4.3 >Reporter: Stuart White >Priority: Minor > Labels: CSV, csv, csvparser > > The spark sql csv reader is not dropping malformed records as expected. > Consider this file (fruit.csv). Notice it contains a header record, 3 valid > records, and one malformed record. > {noformat} > fruit,color,price,quantity > apple,red,1,3 > banana,yellow,2,4 > orange,orange,3,5 > xxx > {noformat} > If I read this file using the spark sql csv reader as follows, everything > looks good. The malformed record is dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").show(truncate=false) > +--+--+-++ > > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > However, if I select a subset of the columns, the malformed record is not > dropped. The malformed data is placed in the first column, and the remaining > column(s) are filled with nulls. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false) > +--+ > |fruit | > +--+ > |apple | > |banana| > |orange| > |xxx | > +--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false) > +--+--+ > |fruit |color | > +--+--+ > |apple |red | > |banana|yellow| > |orange|orange| > |xxx |null | > +--+--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, > 'price).show(truncate=false) > +--+--+-+ > |fruit |color |price| > +--+--+-+ > |apple |red |1| > |banana|yellow|2| > |orange|orange|3| > |xxx |null |null | > +--+--+-+ > {noformat} > And finally, if I manually select all of the columns, the malformed record is > once again dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, > 'quantity).show(truncate=false) > +--+--+-++ > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > I would expect the malformed record(s) to be dropped regardless of which > columns are being selected from the file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
[ https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865650#comment-16865650 ] Liang-Chi Hsieh commented on SPARK-28058: - This is due to CSV parser column pruning. You can disable it and the behavior would like you expect: {code:java} scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false) +--+ |fruit | +--+ |apple | |banana| |orange| |xxx | +--+ scala> spark.conf.set("spark.sql.csv.parser.columnPruning.enabled", false) scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false) +--+ |fruit | +--+ |apple | |banana| |orange| +--+ {code} > Reading csv with DROPMALFORMED sometimes doesn't drop malformed records > --- > > Key: SPARK-28058 > URL: https://issues.apache.org/jira/browse/SPARK-28058 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1, 2.4.3 >Reporter: Stuart White >Priority: Minor > Labels: CSV, csv, csvparser > > The spark sql csv reader is not dropping malformed records as expected. > Consider this file (fruit.csv). Notice it contains a header record, 3 valid > records, and one malformed record. > {noformat} > fruit,color,price,quantity > apple,red,1,3 > banana,yellow,2,4 > orange,orange,3,5 > xxx > {noformat} > If I read this file using the spark sql csv reader as follows, everything > looks good. The malformed record is dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").show(truncate=false) > +--+--+-++ > > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > However, if I select a subset of the columns, the malformed record is not > dropped. The malformed data is placed in the first column, and the remaining > column(s) are filled with nulls. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false) > +--+ > |fruit | > +--+ > |apple | > |banana| > |orange| > |xxx | > +--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false) > +--+--+ > |fruit |color | > +--+--+ > |apple |red | > |banana|yellow| > |orange|orange| > |xxx |null | > +--+--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, > 'price).show(truncate=false) > +--+--+-+ > |fruit |color |price| > +--+--+-+ > |apple |red |1| > |banana|yellow|2| > |orange|orange|3| > |xxx |null |null | > +--+--+-+ > {noformat} > And finally, if I manually select all of the columns, the malformed record is > once again dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, > 'quantity).show(truncate=false) > +--+--+-++ > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > I would expect the malformed record(s) to be dropped regardless of which > columns are being selected from the file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28054) Unable to insert partitioned table dynamically when partition name is upper case
[ https://issues.apache.org/jira/browse/SPARK-28054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865006#comment-16865006 ] Liang-Chi Hsieh commented on SPARK-28054: - I tested on Hive, the query works. Btw, the issue is also reproducible on current master. > Unable to insert partitioned table dynamically when partition name is upper > case > > > Key: SPARK-28054 > URL: https://issues.apache.org/jira/browse/SPARK-28054 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: ChenKai >Priority: Major > > {code:java} > -- create sql and column name is upper case > CREATE TABLE src (KEY STRING, VALUE STRING) PARTITIONED BY (DS STRING) > -- insert sql > INSERT INTO TABLE src PARTITION(ds) SELECT 'k' key, 'v' value, '1' ds > {code} > The error is: > {code:java} > Error in query: > org.apache.hadoop.hive.ql.metadata.Table.ValidationFailureSemanticException: > Partition spec {ds=, DS=1} contains non-partition columns; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28054) Unable to insert partitioned table dynamically when partition name is upper case
[ https://issues.apache.org/jira/browse/SPARK-28054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16864189#comment-16864189 ] Liang-Chi Hsieh commented on SPARK-28054: - Is this query working on Hive? > Unable to insert partitioned table dynamically when partition name is upper > case > > > Key: SPARK-28054 > URL: https://issues.apache.org/jira/browse/SPARK-28054 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: ChenKai >Priority: Major > > {code:java} > -- create sql and column name is upper case > CREATE TABLE src (KEY INT, VALUE STRING) PARTITIONED BY (DS STRING) > -- insert sql > INSERT INTO TABLE src PARTITION(ds) SELECT 'k' key, 'v' value, '1' ds > {code} > The error is: > {code:java} > Error in query: > org.apache.hadoop.hive.ql.metadata.Table.ValidationFailureSemanticException: > Partition spec {ds=, DS=1} contains non-partition columns; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28043) Reading json with duplicate columns drops the first column value
[ https://issues.apache.org/jira/browse/SPARK-28043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16864107#comment-16864107 ] Liang-Chi Hsieh commented on SPARK-28043: - To make duplicate JSON keys work, I think about it and look at our current implementation. One concern is that how do we know which key maps to which Spark SQL field? Suppose we have two duplicate keys "a" as above. We infer the schema of Spark SQL as "a string, a string". Does the order of keys in JSON string imply the order of fields? In our current implementation, such mapping doesn't exist. It means the order of keys can be different in each JSON string. Isn't it prone to unware error when reading JSON? Another option to forbid duplicate JSON keys. Maybe add a legacy config for fallback to current behavior, if we don't want to break existing code? > Reading json with duplicate columns drops the first column value > > > Key: SPARK-28043 > URL: https://issues.apache.org/jira/browse/SPARK-28043 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Mukul Murthy >Priority: Major > > When reading a JSON blob with duplicate fields, Spark appears to ignore the > value of the first one. JSON recommends unique names but does not require it; > since JSON and Spark SQL both allow duplicate field names, we should fix the > bug where the first column value is getting dropped. > > I'm guessing somewhere when parsing JSON, we're turning it into a Map which > is causing the first value to be overridden. > > Repro (Python, 2.4): > {code} > scala> val jsonRDD = spark.sparkContext.parallelize(Seq("[{ \"a\": \"blah\", > \"a\": \"blah2\"} ]")) > jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at > parallelize at :23 > scala> val df = spark.read.json(jsonRDD) > df: org.apache.spark.sql.DataFrame = [a: string, a: string] > > scala> df.show > ++-+ > | a|a| > ++-+ > |null|blah2| > ++-+ > {code} > > The expected response would be: > {code} > ++-+ > | a|a| > ++-+ > |blah|blah2| > ++-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28043) Reading json with duplicate columns drops the first column value
[ https://issues.apache.org/jira/browse/SPARK-28043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863680#comment-16863680 ] Liang-Chi Hsieh commented on SPARK-28043: - I tried to look around that, like https://stackoverflow.com/questions/21832701/does-json-syntax-allow-duplicate-keys-in-an-object. So JSON doesn't disallow duplicate keys. Spark SQL doesn't disallow duplicate field names, although it can be impose some difficulties when using a DataFrame with duplicate field names. To clarify it, just because Spark SQL allows duplicate field names that doesn't mean that we should use such feature. But I think that, to some extent, the current behavior isn't consistent. {code} scala> val jsonRDD = spark.sparkContext.parallelize(Seq("[{ \"a\": \"blah\", \"a\": \"blah2\"} ]")) jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at parallelize at :23 scala> val df = spark.read.json(jsonRDD) df: org.apache.spark.sql.DataFrame = [a: string, a: string] scala> df.show ++-+ | a|a| ++-+ |null|blah2| ++-+ {code} > Reading json with duplicate columns drops the first column value > > > Key: SPARK-28043 > URL: https://issues.apache.org/jira/browse/SPARK-28043 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Mukul Murthy >Priority: Major > > When reading a JSON blob with duplicate fields, Spark appears to ignore the > value of the first one. JSON recommends unique names but does not require it; > since JSON and Spark SQL both allow duplicate field names, we should fix the > bug where the first column value is getting dropped. > > I'm guessing somewhere when parsing JSON, we're turning it into a Map which > is causing the first value to be overridden. > > Repro (Python, 2.4): > >>> jsonRDD = spark.sparkContext.parallelize(["\\{ \"a\": \"blah\", \"a\": > >>> \"blah2\"}"]) > >>> df = spark.read.json(jsonRDD) > >>> df.show() > +-++ > |a|a| > +-++ > |null|blah2| > +-++ > > The expected response would be: > +-++ > |a|a| > +-++ > |blah|blah2| > +-++ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28006) User-defined grouped transform pandas_udf for window operations
[ https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863180#comment-16863180 ] Liang-Chi Hsieh commented on SPARK-28006: - I'm curious about two questions: Can we use pandas agg udfs as window function? Because the proposed GROUPED_XFORM udf calculates output values for all rows in the group, looks like the proposed GROUPED_XFORM udf can only use window frame (UnboundedPreceding, UnboundedFollowing)? > User-defined grouped transform pandas_udf for window operations > --- > > Key: SPARK-28006 > URL: https://issues.apache.org/jira/browse/SPARK-28006 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.4.3 >Reporter: Li Jin >Priority: Major > > Currently, pandas_udf supports "grouped aggregate" type that can be used with > unbounded and unbounded windows. There is another set of use cases that can > benefit from a "grouped transform" type pandas_udf. > Grouped transform is defined as a N -> N mapping over a group. For example, > "compute zscore for values in the group using the grouped mean and grouped > stdev", or "rank the values in the group". > Currently, in order to do this, user needs to use "grouped apply", for > example: > {code:java} > @pandas_udf(schema, GROUPED_MAP) > def subtract_mean(pdf) > v = pdf['v'] > pdf['v'] = v - v.mean() > return pdf > df.groupby('id').apply(subtract_mean) > # +---++ > # | id| v| > # +---++ > # | 1|-0.5| > # | 1| 0.5| > # | 2|-3.0| > # | 2|-1.0| > # | 2| 4.0| > # +---++{code} > This approach has a few downside: > * Specifying the full return schema is complicated for the user although the > function only changes one column. > * The column name 'v' inside as part of the udf, makes the udf less reusable. > * The entire dataframe is serialized to pass to Python although only one > column is needed. > Here we propose a new type of pandas_udf to work with these types of use > cases: > {code:java} > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v")) > @pandas_udf('double', GROUPED_XFORM) > def subtract_mean(v): > return v - v.mean() > w = Window.partitionBy('id') > df = df.withColumn('v', subtract_mean(df['v']).over(w)) > # +---++ > # | id| v| > # +---++ > # | 1|-0.5| > # | 1| 0.5| > # | 2|-3.0| > # | 2|-1.0| > # | 2| 4.0| > # +---++{code} > Which addresses the above downsides. > * The user only needs to specify the output type of a single column. > * The column being zscored is decoupled from the udf implementation > * We only need to send one column to Python worker and concat the result > with the original dataframe (this is what grouped aggregate is doing already) > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27966) input_file_name empty when listing files in parallel
[ https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863030#comment-16863030 ] Liang-Chi Hsieh commented on SPARK-27966: - I can't see where input_file_name is, from the truncated output. If it is just in the Project, I can't tell why it doesn't work. If there is no good reproducer, I agree with [~hyukjin.kwon] that we may resolve this JIRA. > input_file_name empty when listing files in parallel > > > Key: SPARK-27966 > URL: https://issues.apache.org/jira/browse/SPARK-27966 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.0 > Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11) > > Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2 > Workers: 3 > Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2 >Reporter: Christian Homberg >Priority: Minor > Attachments: input_file_name_bug > > > I ran into an issue similar and probably related to SPARK-26128. The > _org.apache.spark.sql.functions.input_file_name_ is sometimes empty. > > {code:java} > df.select(input_file_name()).show(5,false) > {code} > > {code:java} > +-+ > |input_file_name()| > +-+ > | | > | | > | | > | | > | | > +-+ > {code} > My environment is databricks and debugging the Log4j output showed me that > the issue occurred when the files are being listed in parallel, e.g. when > {code:java} > 19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 127; threshold: 32 > 19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories > in parallel under:{code} > > Everything's fine as long as > {code:java} > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 6; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > {code} > > Setting spark.sql.sources.parallelPartitionDiscovery.threshold to > resolves the issue for me. > > *edit: the problem is not exclusively linked to listing files in parallel. > I've setup a larger cluster for which after parallel file listing the > input_file_name did return the correct filename. After inspecting the log4j > again, I assume that it's linked to some kind of MetaStore being full. I've > attached a section of the log4j output that I think should indicate why it's > failing. If you need more, please let me know.* > ** > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org