[jira] [Commented] (SPARK-22517) NullPointerException in ShuffleExternalSorter.spill()
[ https://issues.apache.org/jira/browse/SPARK-22517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16319897#comment-16319897 ] Andreas Maier commented on SPARK-22517: --- [~willshen] I couldn't solve the issue. I circumvented it by splitting my Spark job into several smaller jobs. I did this to avoid spilling from memory to disk so the exception is not triggered. But in the meantime Spark 2.2.1 has been released with a fix for SPARK-21907. Maybe this fixes also this bug here. So if you can, try the newest version of Spark. > NullPointerException in ShuffleExternalSorter.spill() > - > > Key: SPARK-22517 > URL: https://issues.apache.org/jira/browse/SPARK-22517 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Andreas Maier > > I see a NullPointerException during sorting with the following stacktrace: > {code} > 17/11/13 15:02:56 ERROR Executor: Exception in task 138.0 in stage 9.0 (TID > 13497) > java.lang.NullPointerException > at > org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:193) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:254) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281) > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90) > at > org.apache.spark.shuffle.sort.ShuffleInMemorySorter.reset(ShuffleInMemorySorter.java:100) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:256) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281) > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.growPointerArrayIfNecessary(ShuffleExternalSorter.java:328) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:379) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:246) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:167) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22627) Fix formatting of headers in configuration.html page
[ https://issues.apache.org/jira/browse/SPARK-22627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16268807#comment-16268807 ] Andreas Maier edited comment on SPARK-22627 at 11/28/17 2:26 PM: - That is strange. Commits of SPARK-19106 are in branch master and in the tag v2.2.0, but the version deployed online https://spark.apache.org/docs/2.2.0/configuration.html still has the formatting problems. Maybe not all necessary commits for SPARK-19106 made it into the tag v2.2.0? was (Author: asmaier): That is strange. The commits of SPARK-19106 are in branch master and in the tag v2.2.0, but the version deployed online https://spark.apache.org/docs/2.2.0/configuration.html still has the formatting problems. > Fix formatting of headers in configuration.html page > > > Key: SPARK-22627 > URL: https://issues.apache.org/jira/browse/SPARK-22627 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Andreas Maier >Priority: Minor > > On the page https://spark.apache.org/docs/latest/configuration.html one can > see headers in the HTML which look like left overs from the conversion from > Markdown: > {code} > ### Execution Behavior > ... > ### Networking > ... > ### Scheduling > ... > etc... > {code} > The most problems with formatting has the paragraph > {code} > ### Cluster Managers Each cluster manager in Spark has additional > configuration options. Configurations can be found on the pages for each > mode: [YARN](running-on-yarn.html#configuration) > [Mesos](running-on-mesos.html#configuration) [Standalone > Mode](spark-standalone.html#cluster-launch-scripts) # Environment Variables > ... > {code} > As a reader of the documentation I want the headers in the text to be > formatted correctly and not showing Markdown syntax. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22627) Fix formatting of headers in configuration.html page
[ https://issues.apache.org/jira/browse/SPARK-22627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16268807#comment-16268807 ] Andreas Maier commented on SPARK-22627: --- That is strange. The commits of SPARK-19106 are in branch master and in the tag v2.2.0, but the version deployed online https://spark.apache.org/docs/2.2.0/configuration.html still has the formatting problems. > Fix formatting of headers in configuration.html page > > > Key: SPARK-22627 > URL: https://issues.apache.org/jira/browse/SPARK-22627 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Andreas Maier >Priority: Minor > > On the page https://spark.apache.org/docs/latest/configuration.html one can > see headers in the HTML which look like left overs from the conversion from > Markdown: > {code} > ### Execution Behavior > ... > ### Networking > ... > ### Scheduling > ... > etc... > {code} > The most problems with formatting has the paragraph > {code} > ### Cluster Managers Each cluster manager in Spark has additional > configuration options. Configurations can be found on the pages for each > mode: [YARN](running-on-yarn.html#configuration) > [Mesos](running-on-mesos.html#configuration) [Standalone > Mode](spark-standalone.html#cluster-launch-scripts) # Environment Variables > ... > {code} > As a reader of the documentation I want the headers in the text to be > formatted correctly and not showing Markdown syntax. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22631) Consolidate all configuration properties into one page
Andreas Maier created SPARK-22631: - Summary: Consolidate all configuration properties into one page Key: SPARK-22631 URL: https://issues.apache.org/jira/browse/SPARK-22631 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 2.2.0 Reporter: Andreas Maier The page https://spark.apache.org/docs/2.2.0/configuration.html gives the impression as if all configuration properties of Spark are described on this page. Unfortunately this is not true. The description of important properties is spread through the documentation. The following pages list properties, which are not described on the configuration page: https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#performance-tuning https://spark.apache.org/docs/2.2.0/monitoring.html#spark-configuration-options https://spark.apache.org/docs/2.2.0/security.html#ssl-configuration https://spark.apache.org/docs/2.2.0/sparkr.html#starting-up-from-rstudio https://spark.apache.org/docs/2.2.0/running-on-yarn.html#spark-properties https://spark.apache.org/docs/2.2.0/running-on-mesos.html#configuration https://spark.apache.org/docs/2.2.0/spark-standalone.html#cluster-launch-scripts As a reader of the documentation I would like to have single central webpage describing all Spark configuration properties. Alternatively it would be nice to at least add links from the configuration page to the other pages of the documentation, where configuration properties are described. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22630) Consolidate all configuration properties into one page
Andreas Maier created SPARK-22630: - Summary: Consolidate all configuration properties into one page Key: SPARK-22630 URL: https://issues.apache.org/jira/browse/SPARK-22630 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 2.2.0 Reporter: Andreas Maier The page https://spark.apache.org/docs/2.2.0/configuration.html gives the impression as if all configuration properties of Spark are described on this page. Unfortunately this is not true. The description of important properties is spread through the documentation. The following pages list properties, which are not described on the configuration page: https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#performance-tuning https://spark.apache.org/docs/2.2.0/monitoring.html#spark-configuration-options https://spark.apache.org/docs/2.2.0/security.html#ssl-configuration https://spark.apache.org/docs/2.2.0/sparkr.html#starting-up-from-rstudio https://spark.apache.org/docs/2.2.0/running-on-yarn.html#spark-properties https://spark.apache.org/docs/2.2.0/running-on-mesos.html#configuration https://spark.apache.org/docs/2.2.0/spark-standalone.html#cluster-launch-scripts As a reader of the documentation I would like to have single central webpage describing all Spark configuration properties. Alternatively it would be nice to at least add links from the configuration page to the other pages of the documentation, where configuration properties are described. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22627) Fix formatting of headers in configuration.html page
Andreas Maier created SPARK-22627: - Summary: Fix formatting of headers in configuration.html page Key: SPARK-22627 URL: https://issues.apache.org/jira/browse/SPARK-22627 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 2.2.0 Reporter: Andreas Maier Priority: Minor On the page https://spark.apache.org/docs/latest/configuration.html one can see headers in the HTML which look like left overs from the conversion from Markdown: {code} ### Execution Behavior ... ### Networking ... ### Scheduling ... etc... {code} The most problems with formatting has the paragraph {code} ### Cluster Managers Each cluster manager in Spark has additional configuration options. Configurations can be found on the pages for each mode: [YARN](running-on-yarn.html#configuration) [Mesos](running-on-mesos.html#configuration) [Standalone Mode](spark-standalone.html#cluster-launch-scripts) # Environment Variables ... {code} As a reader of the documentation I want the headers in the text to be formatted correctly and not showing Markdown syntax. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22616) df.cache() / df.persist() should have an option blocking like df.unpersist()
[ https://issues.apache.org/jira/browse/SPARK-22616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16268530#comment-16268530 ] Andreas Maier commented on SPARK-22616: --- Ok, I understand your point now. You were thinking in terms of bytecode compatibility and I was just thinking in terms of source code compatibility. > df.cache() / df.persist() should have an option blocking like df.unpersist() > > > Key: SPARK-22616 > URL: https://issues.apache.org/jira/browse/SPARK-22616 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.2.0 >Reporter: Andreas Maier >Priority: Minor > > The method dataframe.unpersist() has an option blocking, which allows for > eager unpersisting of a dataframe. On the other side the method > dataframe.cache() and dataframe.persist() don't have a comparable option. A > (undocumented) workaround for this is to call dataframe.count() directly > after cache() or persist(). But for API consistency and convenience it would > make sense to give cache() and persist() also the option blocking. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22616) df.cache() / df.persist() should have an option blocking like df.unpersist()
[ https://issues.apache.org/jira/browse/SPARK-22616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16266995#comment-16266995 ] Andreas Maier commented on SPARK-22616: --- I don't see how simply adding an option "blocking" with default value "false" is a breaking API change. All the old code would behave as before, only new code could set e.g. df.cache(blocking=True) and see a different behaviour. Or am I wrong? > df.cache() / df.persist() should have an option blocking like df.unpersist() > > > Key: SPARK-22616 > URL: https://issues.apache.org/jira/browse/SPARK-22616 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.2.0 >Reporter: Andreas Maier >Priority: Minor > > The method dataframe.unpersist() has an option blocking, which allows for > eager unpersisting of a dataframe. On the other side the method > dataframe.cache() and dataframe.persist() don't have a comparable option. A > (undocumented) workaround for this is to call dataframe.count() directly > after cache() or persist(). But for API consistency and convenience it would > make sense to give cache() and persist() also the option blocking. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22616) df.cache() / df.persist() should have an option blocking like df.unpersist()
Andreas Maier created SPARK-22616: - Summary: df.cache() / df.persist() should have an option blocking like df.unpersist() Key: SPARK-22616 URL: https://issues.apache.org/jira/browse/SPARK-22616 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Affects Versions: 2.2.0 Reporter: Andreas Maier Priority: Minor The method dataframe.unpersist() has an option blocking, which allows for eager unpersisting of a dataframe. On the other side the method dataframe.cache() and dataframe.persist() don't have a comparable option. A (undocumented) workaround for this is to call dataframe.count() directly after cache() or persist(). But for API consistency and convenience it would make sense to give cache() and persist() also the option blocking. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22613) Make UNCACHE TABLE behaviour consistent with CACHE TABLE
Andreas Maier created SPARK-22613: - Summary: Make UNCACHE TABLE behaviour consistent with CACHE TABLE Key: SPARK-22613 URL: https://issues.apache.org/jira/browse/SPARK-22613 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 2.2.0 Reporter: Andreas Maier Priority: Minor The Spark SQL function CACHE TABLE is eager by default. Therefore it offers an optional keyword LAZY in case you do not want to cache the complete table immediately (See https://docs.databricks.com/spark/latest/spark-sql/language-manual/cache-table.html). But the corresponding Spark SQL function UNCACHE TABLE is lazy by default and doesn't offer an option EAGER (See https://docs.databricks.com/spark/latest/spark-sql/language-manual/uncache-table.html, https://stackoverflow.com/questions/47226494/is-uncache-table-a-lazy-operation-in-spark-sql). So one cannot cache and uncache a table in an eager way using Spark SQL. As a user I want an option EAGER for UNCACHE TABLE. An alternative could be to change the behaviour of UNCACHE TABLE to be eager by default (consistent with CACHE TABLE) and then offer an option LAZY also for UNCACHE TABLE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22517) NullPointerException in ShuffleExternalSorter.spill()
[ https://issues.apache.org/jira/browse/SPARK-22517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257660#comment-16257660 ] Andreas Maier commented on SPARK-22517: --- Unfortunately I don't have some minimal code to reproduce the problem. It occurred after some hours in an computation with hundreds of gigabytes of data, after some data was spilled from memory onto disk. But the problem looks similar to SPARK-21907. > NullPointerException in ShuffleExternalSorter.spill() > - > > Key: SPARK-22517 > URL: https://issues.apache.org/jira/browse/SPARK-22517 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Andreas Maier > > I see a NullPointerException during sorting with the following stacktrace: > {code} > 17/11/13 15:02:56 ERROR Executor: Exception in task 138.0 in stage 9.0 (TID > 13497) > java.lang.NullPointerException > at > org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:193) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:254) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281) > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90) > at > org.apache.spark.shuffle.sort.ShuffleInMemorySorter.reset(ShuffleInMemorySorter.java:100) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:256) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281) > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.growPointerArrayIfNecessary(ShuffleExternalSorter.java:328) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:379) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:246) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:167) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22517) NullPointerException in ShuffleExternalSorter.spill()
Andreas Maier created SPARK-22517: - Summary: NullPointerException in ShuffleExternalSorter.spill() Key: SPARK-22517 URL: https://issues.apache.org/jira/browse/SPARK-22517 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Andreas Maier I see a NullPointerException during sorting with the following stacktrace: {code} 17/11/13 15:02:56 ERROR Executor: Exception in task 138.0 in stage 9.0 (TID 13497) java.lang.NullPointerException at org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:193) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:254) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90) at org.apache.spark.shuffle.sort.ShuffleInMemorySorter.reset(ShuffleInMemorySorter.java:100) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:256) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.growPointerArrayIfNecessary(ShuffleExternalSorter.java:328) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:379) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:246) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:167) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22436) New function strip() to remove all whitespace from string
[ https://issues.apache.org/jira/browse/SPARK-22436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237446#comment-16237446 ] Andreas Maier commented on SPARK-22436: --- Python UDFs are very slow, aren't they? I believe a Spark native function would be much faster. And in fact it was already available with trim() before SPARK-17299 . > New function strip() to remove all whitespace from string > - > > Key: SPARK-22436 > URL: https://issues.apache.org/jira/browse/SPARK-22436 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.2.0 >Reporter: Andreas Maier >Priority: Minor > Labels: features > > Since ticket SPARK-17299 the [trim() > function|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.trim] > will not remove any whitespace characters from beginning and end of a string > but only spaces. This is correct in regard to the SQL standard, but it opens > a gap in functionality. > My suggestion is to add to the Spark API in analogy to pythons standard > library the functions l/r/strip(), which should remove all whitespace > characters from a string from beginning and/or end of a string respectively. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22436) New function strip() to remove all whitespace from string
Andreas Maier created SPARK-22436: - Summary: New function strip() to remove all whitespace from string Key: SPARK-22436 URL: https://issues.apache.org/jira/browse/SPARK-22436 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Affects Versions: 2.2.0 Reporter: Andreas Maier Priority: Minor Since ticket SPARK-17299 the [trim() function|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.trim] will not remove any whitespace characters from beginning and end of a string but only spaces. This is correct in regard to the SQL standard, but it opens a gap in functionality. My suggestion is to add to the Spark API in analogy to pythons standard library the functions l/r/strip(), which should remove all whitespace characters from a string from beginning and/or end of a string respectively. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22428) Document spark properties for configuring the ContextCleaner
Andreas Maier created SPARK-22428: - Summary: Document spark properties for configuring the ContextCleaner Key: SPARK-22428 URL: https://issues.apache.org/jira/browse/SPARK-22428 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 2.2.0 Reporter: Andreas Maier Priority: Minor The spark properties for configuring the ContextCleaner as described on https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-service-contextcleaner.html are not documented in the official documentation at https://spark.apache.org/docs/latest/configuration.html#available-properties . As a user I would like to have the following spark properties documented in the official documentation: {code:java} spark.cleaner.periodicGC.interval spark.cleaner.referenceTracking spark.cleaner.referenceTracking.blocking spark.cleaner.referenceTracking.blocking.shuffle spark.cleaner.referenceTracking.cleanCheckpoints {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22369) PySpark: Document methods of spark.catalog interface
Andreas Maier created SPARK-22369: - Summary: PySpark: Document methods of spark.catalog interface Key: SPARK-22369 URL: https://issues.apache.org/jira/browse/SPARK-22369 Project: Spark Issue Type: Documentation Components: PySpark Affects Versions: 2.2.0 Reporter: Andreas Maier The following methods from the {{spark.catalog}} interface are not documented. {code:java} $ pyspark >>> dir(spark.catalog) ['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_jcatalog', '_jsparkSession', '_reset', '_sparkSession', 'cacheTable', 'clearCache', 'createExternalTable', 'createTable', 'currentDatabase', 'dropGlobalTempView', 'dropTempView', 'isCached', 'listColumns', 'listDatabases', 'listFunctions', 'listTables', 'recoverPartitions', 'refreshByPath', 'refreshTable', 'registerFunction', 'setCurrentDatabase', 'uncacheTable'] {code} As a user I would like to have these methods documented on http://spark.apache.org/docs/latest/api/python/pyspark.sql.html . Old methods of the SQLContext (e.g. {{pyspark.sql.SQLContext.cacheTable()}} vs. {{pyspark.sql.SparkSession.catalog.cacheTable()}} or {{pyspark.sql.HiveContext.refreshTable()}} vs. {{pyspark.sql.SparkSession.catalog.refreshTable()}} ) should point to the new method. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22249) UnsupportedOperationException: empty.reduceLeft when caching a dataframe
[ https://issues.apache.org/jira/browse/SPARK-22249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216634#comment-16216634 ] Andreas Maier commented on SPARK-22249: --- Thank you for solving this issue so quickly. Not every open source project is reacting so fast. > UnsupportedOperationException: empty.reduceLeft when caching a dataframe > > > Key: SPARK-22249 > URL: https://issues.apache.org/jira/browse/SPARK-22249 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 > Environment: $ uname -a > Darwin MAC-UM-024.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 > 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64 > $ pyspark --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.2.0 > /_/ > > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_92 > Branch > Compiled by user jenkins on 2017-06-30T22:58:04Z > Revision > Url >Reporter: Andreas Maier >Assignee: Marco Gaido > Fix For: 2.2.1, 2.3.0 > > > It seems that the {{isin()}} method with an empty list as argument only > works, if the dataframe is not cached. If it is cached, it results in an > exception. To reproduce > {code:java} > $ pyspark > >>> df = spark.createDataFrame([pyspark.Row(KEY="value")]) > >>> df.where(df["KEY"].isin([])).show() > +---+ > |KEY| > +---+ > +---+ > >>> df.cache() > DataFrame[KEY: string] > >>> df.where(df["KEY"].isin([])).show() > Traceback (most recent call last): > File "", line 1, in > File > "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/dataframe.py", > line 336, in show > print(self._jdf.showString(n, 20)) > File > "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", > line 1133, in __call__ > File > "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", > line 319, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o302.showString. > : java.lang.UnsupportedOperationException: empty.reduceLeft > at > scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:180) > at > scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayBuffer.scala:48) > at > scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:74) > at scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:208) > at scala.collection.AbstractTraversable.reduce(Traversable.scala:104) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:107) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:71) > at scala.PartialFunction$Lifted.apply(PartialFunction.scala:223) > at scala.PartialFunction$Lifted.apply(PartialFunction.scala:219) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:112) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:111) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.(InMemoryTableScanExec.scala:111) > at > org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307) > at > org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307) > at > org.apache.spark.sql.execution.SparkPlanner.pruneFilterProject(SparkPlanner.scala:99) > at > org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$.apply(SparkStrategies.scala:303) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonf
[jira] [Created] (SPARK-22249) UnsupportedOperationException: empty.reduceLeft when caching a dataframe
Andreas Maier created SPARK-22249: - Summary: UnsupportedOperationException: empty.reduceLeft when caching a dataframe Key: SPARK-22249 URL: https://issues.apache.org/jira/browse/SPARK-22249 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.2.0 Environment: $ uname -a Darwin MAC-UM-024.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64 $ pyspark --version Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.2.0 /_/ Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_92 Branch Compiled by user jenkins on 2017-06-30T22:58:04Z Revision Url Reporter: Andreas Maier It seems that the {{isin()}} method with an empty list as argument only works, if the dataframe is not cached. If it is cached, it results in an exception. To reproduce {code:java} $ pyspark >>> df = spark.createDataFrame([pyspark.Row(KEY="value")]) >>> df.where(df["KEY"].isin([])).show() +---+ |KEY| +---+ +---+ >>> df.cache() DataFrame[KEY: string] >>> df.where(df["KEY"].isin([])).show() Traceback (most recent call last): File "", line 1, in File "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 336, in show print(self._jdf.showString(n, 20)) File "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o302.showString. : java.lang.UnsupportedOperationException: empty.reduceLeft at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:180) at scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayBuffer.scala:48) at scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:74) at scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:208) at scala.collection.AbstractTraversable.reduce(Traversable.scala:104) at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:107) at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:71) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:223) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:219) at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:112) at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:111) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:344) at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.(InMemoryTableScanExec.scala:111) at org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307) at org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307) at org.apache.spark.sql.execution.SparkPlanner.pruneFilterProject(SparkPlanner.scala:99) at org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$.apply(SparkStrategies.scala:303) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonf
[jira] [Commented] (STORM-2077) KafkaSpout doesn't retry failed tuples
[ https://issues.apache.org/jira/browse/STORM-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15712237#comment-15712237 ] Andreas Maier commented on STORM-2077: -- Is this ticket related to STORM-2087 ? > KafkaSpout doesn't retry failed tuples > -- > > Key: STORM-2077 > URL: https://issues.apache.org/jira/browse/STORM-2077 > Project: Apache Storm > Issue Type: Bug > Components: storm-kafka >Affects Versions: 1.0.2 >Reporter: Tobias Maier > > KafkaSpout does not retry all failed tuples. > We used following Configuration: > Map props = new HashMap<>(); > props.put(KafkaSpoutConfig.Consumer.GROUP_ID, "c1"); > props.put(KafkaSpoutConfig.Consumer.KEY_DESERIALIZER, > ByteArrayDeserializer.class.getName()); > props.put(KafkaSpoutConfig.Consumer.VALUE_DESERIALIZER, > ByteArrayDeserializer.class.getName()); > props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, > broker.bootstrapServer()); > KafkaSpoutStreams kafkaSpoutStreams = new > KafkaSpoutStreams.Builder(FIELDS_KAFKA_EVENT, new > String[]{"test-topic"}).build(); > KafkaSpoutTuplesBuilder kafkaSpoutTuplesBuilder = new > KafkaSpoutTuplesBuilder.Builder<>(new > KeyValueKafkaSpoutTupleBuilder("test-topic")).build(); > KafkaSpoutRetryService retryService = new > KafkaSpoutLoggedRetryExponentialBackoff(KafkaSpoutLoggedRetryExponentialBackoff.TimeInterval.milliSeconds(1), > KafkaSpoutLoggedRetryExponentialBackoff.TimeInterval.milliSeconds(1), 3, > KafkaSpoutLoggedRetryExponentialBackoff.TimeInterval.seconds(1)); > KafkaSpoutConfig config = new > KafkaSpoutConfig.Builder<>(props, kafkaSpoutStreams, kafkaSpoutTuplesBuilder, > retryService) > .setFirstPollOffsetStrategy(UNCOMMITTED_LATEST) > .setMaxUncommittedOffsets(30) > .setOffsetCommitPeriodMs(10) > .setMaxRetries(3) > .build(); > kafkaSpout = new org.apache.storm.kafka.spout.KafkaSpout<>(config); > The downstream bolt fails every tuple and we expect, that those tuple will > all be replayed. But that's not the case for every tuple. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-2087) Storm-kafka-client: Failed tuples are not always replayed
[ https://issues.apache.org/jira/browse/STORM-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15708199#comment-15708199 ] Andreas Maier commented on STORM-2087: -- Is this a duplicate of STORM-2077 ? > Storm-kafka-client: Failed tuples are not always replayed > -- > > Key: STORM-2087 > URL: https://issues.apache.org/jira/browse/STORM-2087 > Project: Apache Storm > Issue Type: Bug >Reporter: Jeff Fenchel > Time Spent: 9h > Remaining Estimate: 0h > > I am working with kafka 10 and the storm-kafka-client from master. It appears > that tuples are not always being replayed when they are failed. > With a topology that randomly fails tuples a small percentage of the time I > found that the committed kafka offset would get stuck and eventually > processing would stop even though the committed offset was no where near the > end of the topic. > I have also replicated the issue in unit tests with this PR: > https://github.com/apache/storm/pull/1679 > It seems that increasing the number of times I call nextTuple for the in > order case will make it work, but it doesn't seem to help the case where > tuples are failed out of order from which they were emitted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1927) If a default value is set, Avro allows null values in non-nullable fields.
[ https://issues.apache.org/jira/browse/AVRO-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15568879#comment-15568879 ] Andreas Maier commented on AVRO-1927: - So I had a look at the generated Java code. And strangely enough, this code only throws an exception, if no default value is set: {code} protected void validate(Field field, Object value) { if(!isValidValue(field, value)) { if(field.defaultValue() == null) { // why this check? throw new AvroRuntimeException("Field " + field + " does not accept null values"); } } } {code} I don't understand why Avro checks, if {{if(field.defaultValue() == null)}} before throwing an exception. In my opinion it should always throw an exception if the field value is invalid. > If a default value is set, Avro allows null values in non-nullable fields. > -- > > Key: AVRO-1927 > URL: https://issues.apache.org/jira/browse/AVRO-1927 > Project: Avro > Issue Type: Bug > Components: java >Affects Versions: 1.8.1 >Reporter: Andreas Maier > Labels: newbie > > With an avro schema like > {code} > { > "name": "myfield", > "type": "string", > "default": "" > } > {code} > the following code should throw an exception > {code} > MyObject myObject = MyObject.newBuilder().setMyfield(null).build(); > {code} > But instead the value of myfield is set to null, which causes an exception > later when serializing myObject, because null is not a valid value for > myfield. > I believe in this case setMyfield(null) should throw an exception, > independent of the value of default. > See also > https://stackoverflow.com/questions/38509279/generated-avro-builder-set-null-doesnt-overwrite-with-default -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1927) If a default value is set, Avro allows null values in non-nullable fields.
[ https://issues.apache.org/jira/browse/AVRO-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15568867#comment-15568867 ] Andreas Maier commented on AVRO-1927: - Sorry for the long delay. I was on vacation. I tried the code you suggested. You are right, it does throw a NullPointerException when I try to write the object: {code} java.lang.NullPointerException: null of string of de.am.MyObject at org.apache.avro.generic.GenericDatumWriter.npe(GenericDatumWriter.java:132) at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:126) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:73) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:60) at de.am.MyObjectTest.myObjectTest2(MyObjectTest.java:27) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.junit.runner.JUnitCore.run(JUnitCore.java:137) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:119) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:42) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:234) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:74) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144) Caused by: java.lang.NullPointerException at org.apache.avro.specific.SpecificDatumWriter.writeString(SpecificDatumWriter.java:67) at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:115) at org.apache.avro.specific.SpecificDatumWriter.writeField(SpecificDatumWriter.java:87) at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:143) at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:105) ... 30 more {code} > If a default value is set, Avro allows null values in non-nullable fields. > -- > > Key: AVRO-1927 > URL: https://issues.apache.org/jira/browse/AVRO-1927 > Project: Avro > Issue Type: Bug > Components: java >Affects Versions: 1.8.1 >Reporter: Andreas Maier > Labels: newbie > > With an avro schema like > {code} > { > "name": "myfield", > "type": "string", > "default": "" > } > {code} > the following code should throw an exception > {code} > MyObject myObject = MyObject.newBuilder().setMyfield(null).build(); > {code} > But instead the value of myfield is set to null, which causes an exception > later when serializing myObject, because null is not a valid value for > myfield. > I believe in this case setMyfield(null) should throw an exception, > independent of the value of default. > See also > https://stackoverflow.com/questions/38509279/generated-avro-builder-set-null-doesnt-overwrite-with-default -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (STORM-2123) Adding a ByteKeyValueScheme class
Andreas Maier created STORM-2123: Summary: Adding a ByteKeyValueScheme class Key: STORM-2123 URL: https://issues.apache.org/jira/browse/STORM-2123 Project: Apache Storm Issue Type: Improvement Components: storm-kafka Reporter: Andreas Maier It is a common use case to send and receive byte key and values via Kafka. Unfortunately the storm-kafka package misses support for this. The pull request https://github.com/apache/storm/pull/1711 adds such a ByteKeyValueScheme class . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AVRO-1927) If a default value is set, Avro allows null values in non-nullable fields.
Andreas Maier created AVRO-1927: --- Summary: If a default value is set, Avro allows null values in non-nullable fields. Key: AVRO-1927 URL: https://issues.apache.org/jira/browse/AVRO-1927 Project: Avro Issue Type: Bug Components: java Affects Versions: 1.8.1 Reporter: Andreas Maier With an avro schema like {code} { "name": "myfield", "type": "string", "default": "" } {code} the following code should throw an exception {code} MyObject myObject = MyObject.newBuilder().setMyfield(null).build(); {code} But instead the value of myfield is set to null, which causes an exception later when serializing myObject, because null is not a valid value for myfield. I believe in this case setMyfield(null) should throw an exception, independent of the value of default. See also https://stackoverflow.com/questions/38509279/generated-avro-builder-set-null-doesnt-overwrite-with-default -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (STORM-2121) StringKeyValueScheme doesn't override getOutputFields()
Andreas Maier created STORM-2121: Summary: StringKeyValueScheme doesn't override getOutputFields() Key: STORM-2121 URL: https://issues.apache.org/jira/browse/STORM-2121 Project: Apache Storm Issue Type: Bug Components: storm-kafka Reporter: Andreas Maier In org.apache.storm.kafka StringKeyValueScheme extends StringScheme. However it doesn't override the method getOutputFields() from StringScheme {code} public Fields getOutputFields() { return new Fields(STRING_SCHEME_KEY); } {code} And this method returns only one field, instead of two (one for key and one for value), which causes problems. It would be better to override the method getOutputFields() in StringKeyValueScheme with e.g. {code} @Override public Fields getOutputFields() { return new Fields(FieldNameBasedTupleToKafkaMapper.BOLT_KEY, FieldNameBasedTupleToKafkaMapper.BOLT_MESSAGE); } {code} The important thing is, that getOutputFields() should return two fields and not one. -- This message was sent by Atlassian JIRA (v6.3.4#6332)