from:"\"Andreas Maier \\\(JIRA\\\)\""

[jira] [Commented] (SPARK-22517) NullPointerException in ShuffleExternalSorter.spill()

2018-01-10 Thread Andreas Maier (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16319897#comment-16319897
 ] 

Andreas Maier commented on SPARK-22517:
---

[~willshen] I couldn't solve the issue. I circumvented it by splitting my Spark 
job into several smaller jobs. I did this to avoid spilling from memory to disk 
so the exception is not triggered. But in the meantime Spark 2.2.1 has been 
released with a fix for SPARK-21907. Maybe this fixes also this bug here. So if 
you can, try the newest version of Spark. 

> NullPointerException in ShuffleExternalSorter.spill()
> -
>
> Key: SPARK-22517
> URL: https://issues.apache.org/jira/browse/SPARK-22517
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>
> I see a NullPointerException during sorting with the following stacktrace:
> {code}
> 17/11/13 15:02:56 ERROR Executor: Exception in task 138.0 in stage 9.0 (TID 
> 13497)
> java.lang.NullPointerException
> at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:193)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:254)
> at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
> at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
> at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
> at 
> org.apache.spark.shuffle.sort.ShuffleInMemorySorter.reset(ShuffleInMemorySorter.java:100)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:256)
> at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
> at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
> at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.growPointerArrayIfNecessary(ShuffleExternalSorter.java:328)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:379)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:246)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:167)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22627) Fix formatting of headers in configuration.html page

2017-11-28 Thread Andreas Maier (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16268807#comment-16268807
 ] 

Andreas Maier edited comment on SPARK-22627 at 11/28/17 2:26 PM:
-

That is strange. Commits of SPARK-19106 are in branch master and in the tag 
v2.2.0, but the version deployed online 
https://spark.apache.org/docs/2.2.0/configuration.html still has the formatting 
problems. Maybe not all necessary commits for SPARK-19106 made it into the tag 
v2.2.0? 


was (Author: asmaier):
That is strange. The commits of SPARK-19106 are in branch master and in the tag 
v2.2.0, but the version deployed online 
https://spark.apache.org/docs/2.2.0/configuration.html still has the formatting 
problems. 

> Fix formatting of headers in configuration.html page
> 
>
> Key: SPARK-22627
> URL: https://issues.apache.org/jira/browse/SPARK-22627
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>Priority: Minor
>
> On the page https://spark.apache.org/docs/latest/configuration.html one can 
> see headers in the HTML which look like left overs from the conversion from 
> Markdown:
> {code}
> ### Execution Behavior
> ...
> ### Networking
> ...
> ### Scheduling
> ...
> etc...
> {code}
> The most problems with formatting has the paragraph 
> {code}
> ### Cluster Managers Each cluster manager in Spark has additional 
> configuration options. Configurations can be found on the pages for each 
> mode:  [YARN](running-on-yarn.html#configuration)  
> [Mesos](running-on-mesos.html#configuration)  [Standalone 
> Mode](spark-standalone.html#cluster-launch-scripts) # Environment Variables 
> ...
> {code}
> As a reader of the documentation I want the headers in the text to be 
> formatted correctly and not showing Markdown syntax. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22627) Fix formatting of headers in configuration.html page

2017-11-28 Thread Andreas Maier (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16268807#comment-16268807
 ] 

Andreas Maier commented on SPARK-22627:
---

That is strange. The commits of SPARK-19106 are in branch master and in the tag 
v2.2.0, but the version deployed online 
https://spark.apache.org/docs/2.2.0/configuration.html still has the formatting 
problems. 

> Fix formatting of headers in configuration.html page
> 
>
> Key: SPARK-22627
> URL: https://issues.apache.org/jira/browse/SPARK-22627
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>Priority: Minor
>
> On the page https://spark.apache.org/docs/latest/configuration.html one can 
> see headers in the HTML which look like left overs from the conversion from 
> Markdown:
> {code}
> ### Execution Behavior
> ...
> ### Networking
> ...
> ### Scheduling
> ...
> etc...
> {code}
> The most problems with formatting has the paragraph 
> {code}
> ### Cluster Managers Each cluster manager in Spark has additional 
> configuration options. Configurations can be found on the pages for each 
> mode:  [YARN](running-on-yarn.html#configuration)  
> [Mesos](running-on-mesos.html#configuration)  [Standalone 
> Mode](spark-standalone.html#cluster-launch-scripts) # Environment Variables 
> ...
> {code}
> As a reader of the documentation I want the headers in the text to be 
> formatted correctly and not showing Markdown syntax. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22631) Consolidate all configuration properties into one page

2017-11-28 Thread Andreas Maier (JIRA)

Andreas Maier created SPARK-22631:
-

 Summary: Consolidate all configuration properties into one page
 Key: SPARK-22631
 URL: https://issues.apache.org/jira/browse/SPARK-22631
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.2.0
Reporter: Andreas Maier


The page https://spark.apache.org/docs/2.2.0/configuration.html gives the 
impression as if all configuration properties of Spark are described on this 
page. Unfortunately this is not true. The description of important properties 
is spread through the documentation. The following pages list properties, which 
are not described on the configuration page: 

https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#performance-tuning
https://spark.apache.org/docs/2.2.0/monitoring.html#spark-configuration-options
https://spark.apache.org/docs/2.2.0/security.html#ssl-configuration
https://spark.apache.org/docs/2.2.0/sparkr.html#starting-up-from-rstudio
https://spark.apache.org/docs/2.2.0/running-on-yarn.html#spark-properties
https://spark.apache.org/docs/2.2.0/running-on-mesos.html#configuration
https://spark.apache.org/docs/2.2.0/spark-standalone.html#cluster-launch-scripts

As a reader of the documentation I would like to have single central webpage 
describing all Spark configuration properties. Alternatively it would be nice 
to at least add links from the configuration page to the other pages of the 
documentation, where configuration properties are described. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22630) Consolidate all configuration properties into one page

2017-11-28 Thread Andreas Maier (JIRA)

Andreas Maier created SPARK-22630:
-

 Summary: Consolidate all configuration properties into one page
 Key: SPARK-22630
 URL: https://issues.apache.org/jira/browse/SPARK-22630
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.2.0
Reporter: Andreas Maier


The page https://spark.apache.org/docs/2.2.0/configuration.html gives the 
impression as if all configuration properties of Spark are described on this 
page. Unfortunately this is not true. The description of important properties 
is spread through the documentation. The following pages list properties, which 
are not described on the configuration page: 

https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#performance-tuning
https://spark.apache.org/docs/2.2.0/monitoring.html#spark-configuration-options
https://spark.apache.org/docs/2.2.0/security.html#ssl-configuration
https://spark.apache.org/docs/2.2.0/sparkr.html#starting-up-from-rstudio
https://spark.apache.org/docs/2.2.0/running-on-yarn.html#spark-properties
https://spark.apache.org/docs/2.2.0/running-on-mesos.html#configuration
https://spark.apache.org/docs/2.2.0/spark-standalone.html#cluster-launch-scripts

As a reader of the documentation I would like to have single central webpage 
describing all Spark configuration properties. Alternatively it would be nice 
to at least add links from the configuration page to the other pages of the 
documentation, where configuration properties are described. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22627) Fix formatting of headers in configuration.html page

2017-11-28 Thread Andreas Maier (JIRA)

Andreas Maier created SPARK-22627:
-

 Summary: Fix formatting of headers in configuration.html page
 Key: SPARK-22627
 URL: https://issues.apache.org/jira/browse/SPARK-22627
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.2.0
Reporter: Andreas Maier
Priority: Minor


On the page https://spark.apache.org/docs/latest/configuration.html one can see 
headers in the HTML which look like left overs from the conversion from 
Markdown:
{code}
### Execution Behavior
...
### Networking
...
### Scheduling
...
etc...
{code}
The most problems with formatting has the paragraph 
{code}
### Cluster Managers Each cluster manager in Spark has additional configuration 
options. Configurations can be found on the pages for each mode:  
[YARN](running-on-yarn.html#configuration)  
[Mesos](running-on-mesos.html#configuration)  [Standalone 
Mode](spark-standalone.html#cluster-launch-scripts) # Environment Variables 
...
{code}

As a reader of the documentation I want the headers in the text to be formatted 
correctly and not showing Markdown syntax. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22616) df.cache() / df.persist() should have an option blocking like df.unpersist()

2017-11-28 Thread Andreas Maier (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16268530#comment-16268530
 ] 

Andreas Maier commented on SPARK-22616:
---

Ok, I understand your point now. You were thinking in terms of bytecode 
compatibility and I was just thinking in terms of source code compatibility. 

> df.cache() / df.persist() should have an option blocking like df.unpersist()
> 
>
> Key: SPARK-22616
> URL: https://issues.apache.org/jira/browse/SPARK-22616
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>Priority: Minor
>
> The method dataframe.unpersist() has an option blocking, which allows for 
> eager unpersisting of a dataframe. On the other side the method 
> dataframe.cache() and dataframe.persist() don't have a comparable option. A 
> (undocumented) workaround for this is to call dataframe.count() directly 
> after cache() or persist(). But for API consistency and convenience it would 
> make sense to give cache() and persist() also the option blocking. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22616) df.cache() / df.persist() should have an option blocking like df.unpersist()

2017-11-27 Thread Andreas Maier (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16266995#comment-16266995
 ] 

Andreas Maier commented on SPARK-22616:
---

I don't see how simply adding an option "blocking" with default value "false" 
is a breaking API change. All the old code would behave as before, only new 
code could set e.g. df.cache(blocking=True) and see a different behaviour. Or 
am I wrong?

> df.cache() / df.persist() should have an option blocking like df.unpersist()
> 
>
> Key: SPARK-22616
> URL: https://issues.apache.org/jira/browse/SPARK-22616
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>Priority: Minor
>
> The method dataframe.unpersist() has an option blocking, which allows for 
> eager unpersisting of a dataframe. On the other side the method 
> dataframe.cache() and dataframe.persist() don't have a comparable option. A 
> (undocumented) workaround for this is to call dataframe.count() directly 
> after cache() or persist(). But for API consistency and convenience it would 
> make sense to give cache() and persist() also the option blocking. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22616) df.cache() / df.persist() should have an option blocking like df.unpersist()

2017-11-27 Thread Andreas Maier (JIRA)

Andreas Maier created SPARK-22616:
-

 Summary: df.cache() / df.persist() should have an option blocking 
like df.unpersist()
 Key: SPARK-22616
 URL: https://issues.apache.org/jira/browse/SPARK-22616
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 2.2.0
Reporter: Andreas Maier
Priority: Minor


The method dataframe.unpersist() has an option blocking, which allows for eager 
unpersisting of a dataframe. On the other side the method dataframe.cache() and 
dataframe.persist() don't have a comparable option. A (undocumented) workaround 
for this is to call dataframe.count() directly after cache() or persist(). But 
for API consistency and convenience it would make sense to give cache() and 
persist() also the option blocking. 





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22613) Make UNCACHE TABLE behaviour consistent with CACHE TABLE

2017-11-27 Thread Andreas Maier (JIRA)

Andreas Maier created SPARK-22613:
-

 Summary: Make UNCACHE TABLE behaviour consistent with CACHE TABLE
 Key: SPARK-22613
 URL: https://issues.apache.org/jira/browse/SPARK-22613
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 2.2.0
Reporter: Andreas Maier
Priority: Minor


The Spark SQL function CACHE TABLE is eager by default. Therefore it offers an 
optional keyword LAZY in case you do not want to cache the complete table 
immediately (See 
https://docs.databricks.com/spark/latest/spark-sql/language-manual/cache-table.html).
 But the corresponding Spark SQL function UNCACHE TABLE is lazy by default and 
doesn't offer an option EAGER (See 
https://docs.databricks.com/spark/latest/spark-sql/language-manual/uncache-table.html,
 
https://stackoverflow.com/questions/47226494/is-uncache-table-a-lazy-operation-in-spark-sql).
 So one cannot cache and uncache a table in an eager way using Spark SQL. 

As a user I want an option EAGER for UNCACHE TABLE. An alternative could be to 
change the behaviour of UNCACHE TABLE to be eager by default (consistent with 
CACHE TABLE) and then offer an option LAZY also for UNCACHE TABLE. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22517) NullPointerException in ShuffleExternalSorter.spill()

2017-11-17 Thread Andreas Maier (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257660#comment-16257660
 ] 

Andreas Maier commented on SPARK-22517:
---

Unfortunately I don't have some minimal code to reproduce the problem. It 
occurred after some hours in an computation with hundreds of gigabytes of data, 
after some data was spilled from memory onto disk. But the problem looks 
similar to SPARK-21907. 

> NullPointerException in ShuffleExternalSorter.spill()
> -
>
> Key: SPARK-22517
> URL: https://issues.apache.org/jira/browse/SPARK-22517
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>
> I see a NullPointerException during sorting with the following stacktrace:
> {code}
> 17/11/13 15:02:56 ERROR Executor: Exception in task 138.0 in stage 9.0 (TID 
> 13497)
> java.lang.NullPointerException
> at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:193)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:254)
> at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
> at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
> at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
> at 
> org.apache.spark.shuffle.sort.ShuffleInMemorySorter.reset(ShuffleInMemorySorter.java:100)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:256)
> at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
> at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
> at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.growPointerArrayIfNecessary(ShuffleExternalSorter.java:328)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:379)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:246)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:167)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22517) NullPointerException in ShuffleExternalSorter.spill()

2017-11-14 Thread Andreas Maier (JIRA)

Andreas Maier created SPARK-22517:
-

 Summary: NullPointerException in ShuffleExternalSorter.spill()
 Key: SPARK-22517
 URL: https://issues.apache.org/jira/browse/SPARK-22517
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Andreas Maier


I see a NullPointerException during sorting with the following stacktrace:
{code}
17/11/13 15:02:56 ERROR Executor: Exception in task 138.0 in stage 9.0 (TID 
13497)
java.lang.NullPointerException
at 
org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:193)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:254)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
at 
org.apache.spark.shuffle.sort.ShuffleInMemorySorter.reset(ShuffleInMemorySorter.java:100)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:256)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.growPointerArrayIfNecessary(ShuffleExternalSorter.java:328)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:379)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:246)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:167)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22436) New function strip() to remove all whitespace from string

2017-11-03 Thread Andreas Maier (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237446#comment-16237446
 ] 

Andreas Maier commented on SPARK-22436:
---

Python UDFs are very slow, aren't they? I believe a Spark native function would 
be much faster. And in fact it was already available with trim() before 
SPARK-17299 .

> New function strip() to remove all whitespace from string
> -
>
> Key: SPARK-22436
> URL: https://issues.apache.org/jira/browse/SPARK-22436
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>Priority: Minor
>  Labels: features
>
> Since ticket SPARK-17299 the [trim() 
> function|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.trim]
>  will not remove any whitespace characters from beginning and end of a string 
> but only spaces. This is correct in regard to the SQL standard, but it opens 
> a gap in functionality. 
> My suggestion is to add to the Spark API in analogy to pythons standard 
> library the functions l/r/strip(), which should remove all whitespace 
> characters from a string from beginning and/or end of a string respectively. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22436) New function strip() to remove all whitespace from string

2017-11-03 Thread Andreas Maier (JIRA)

Andreas Maier created SPARK-22436:
-

 Summary: New function strip() to remove all whitespace from string
 Key: SPARK-22436
 URL: https://issues.apache.org/jira/browse/SPARK-22436
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 2.2.0
Reporter: Andreas Maier
Priority: Minor


Since ticket SPARK-17299 the [trim() 
function|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.trim]
 will not remove any whitespace characters from beginning and end of a string 
but only spaces. This is correct in regard to the SQL standard, but it opens a 
gap in functionality. 

My suggestion is to add to the Spark API in analogy to pythons standard library 
the functions l/r/strip(), which should remove all whitespace characters from a 
string from beginning and/or end of a string respectively. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22428) Document spark properties for configuring the ContextCleaner

2017-11-02 Thread Andreas Maier (JIRA)

Andreas Maier created SPARK-22428:
-

 Summary: Document spark properties for configuring the 
ContextCleaner
 Key: SPARK-22428
 URL: https://issues.apache.org/jira/browse/SPARK-22428
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.2.0
Reporter: Andreas Maier
Priority: Minor


The spark properties for configuring the ContextCleaner as described on 
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-service-contextcleaner.html
 are not documented in the official documentation at 
https://spark.apache.org/docs/latest/configuration.html#available-properties . 

As a user I would like to have the following spark properties documented in the 
official documentation:

{code:java}
spark.cleaner.periodicGC.interval
spark.cleaner.referenceTracking 
spark.cleaner.referenceTracking.blocking
spark.cleaner.referenceTracking.blocking.shuffle
spark.cleaner.referenceTracking.cleanCheckpoints
{code}






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22369) PySpark: Document methods of spark.catalog interface

2017-10-27 Thread Andreas Maier (JIRA)

Andreas Maier created SPARK-22369:
-

 Summary: PySpark: Document methods of spark.catalog interface
 Key: SPARK-22369
 URL: https://issues.apache.org/jira/browse/SPARK-22369
 Project: Spark
  Issue Type: Documentation
  Components: PySpark
Affects Versions: 2.2.0
Reporter: Andreas Maier


The following methods from the {{spark.catalog}} interface are not documented.

{code:java}
$ pyspark
>>> dir(spark.catalog)
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', 
'__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', 
'__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', 
'__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', 
'__str__', '__subclasshook__', '__weakref__', '_jcatalog', '_jsparkSession', 
'_reset', '_sparkSession', 'cacheTable', 'clearCache', 'createExternalTable', 
'createTable', 'currentDatabase', 'dropGlobalTempView', 'dropTempView', 
'isCached', 'listColumns', 'listDatabases', 'listFunctions', 'listTables', 
'recoverPartitions', 'refreshByPath', 'refreshTable', 'registerFunction', 
'setCurrentDatabase', 'uncacheTable']
{code}
As a user I would like to have these methods documented on 
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html . Old methods 
of the SQLContext (e.g. {{pyspark.sql.SQLContext.cacheTable()}} vs. 
{{pyspark.sql.SparkSession.catalog.cacheTable()}} or 
{{pyspark.sql.HiveContext.refreshTable()}} vs. 
{{pyspark.sql.SparkSession.catalog.refreshTable()}} ) should point to the new 
method. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22249) UnsupportedOperationException: empty.reduceLeft when caching a dataframe

2017-10-24 Thread Andreas Maier (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216634#comment-16216634
 ] 

Andreas Maier commented on SPARK-22249:
---

Thank you for solving this issue so quickly. Not every open source project is 
reacting so fast. 

> UnsupportedOperationException: empty.reduceLeft when caching a dataframe
> 
>
> Key: SPARK-22249
> URL: https://issues.apache.org/jira/browse/SPARK-22249
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
> Environment: $ uname -a
> Darwin MAC-UM-024.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 
> 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64
> $ pyspark --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.2.0
>   /_/
> 
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_92
> Branch 
> Compiled by user jenkins on 2017-06-30T22:58:04Z
> Revision 
> Url 
>Reporter: Andreas Maier
>Assignee: Marco Gaido
> Fix For: 2.2.1, 2.3.0
>
>
> It seems that the {{isin()}} method with an empty list as argument only 
> works, if the dataframe is not cached. If it is cached, it results in an 
> exception. To reproduce
> {code:java}
> $ pyspark
> >>> df = spark.createDataFrame([pyspark.Row(KEY="value")])
> >>> df.where(df["KEY"].isin([])).show()
> +---+
> |KEY|
> +---+
> +---+
> >>> df.cache()
> DataFrame[KEY: string]
> >>> df.where(df["KEY"].isin([])).show()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/dataframe.py",
>  line 336, in show
> print(self._jdf.showString(n, 20))
>   File 
> "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
>  line 1133, in __call__
>   File 
> "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/utils.py",
>  line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py",
>  line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o302.showString.
> : java.lang.UnsupportedOperationException: empty.reduceLeft
>   at 
> scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:180)
>   at 
> scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayBuffer.scala:48)
>   at 
> scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:74)
>   at scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:48)
>   at 
> scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:208)
>   at scala.collection.AbstractTraversable.reduce(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:107)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:71)
>   at scala.PartialFunction$Lifted.apply(PartialFunction.scala:223)
>   at scala.PartialFunction$Lifted.apply(PartialFunction.scala:219)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:112)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:111)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.(InMemoryTableScanExec.scala:111)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307)
>   at 
> org.apache.spark.sql.execution.SparkPlanner.pruneFilterProject(SparkPlanner.scala:99)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$.apply(SparkStrategies.scala:303)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonf

[jira] [Created] (SPARK-22249) UnsupportedOperationException: empty.reduceLeft when caching a dataframe

2017-10-11 Thread Andreas Maier (JIRA)

Andreas Maier created SPARK-22249:
-

 Summary: UnsupportedOperationException: empty.reduceLeft when 
caching a dataframe
 Key: SPARK-22249
 URL: https://issues.apache.org/jira/browse/SPARK-22249
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.0
 Environment: $ uname -a
Darwin MAC-UM-024.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 
17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64
$ pyspark --version
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
  /_/

Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_92
Branch 
Compiled by user jenkins on 2017-06-30T22:58:04Z
Revision 
Url 
Reporter: Andreas Maier


It seems that the {{isin()}} method with an empty list as argument only works, 
if the dataframe is not cached. If it is cached, it results in an exception. To 
reproduce

{code:java}
$ pyspark
>>> df = spark.createDataFrame([pyspark.Row(KEY="value")])
>>> df.where(df["KEY"].isin([])).show()
+---+
|KEY|
+---+
+---+

>>> df.cache()
DataFrame[KEY: string]
>>> df.where(df["KEY"].isin([])).show()
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/dataframe.py",
 line 336, in show
print(self._jdf.showString(n, 20))
  File 
"/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1133, in __call__
  File 
"/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/utils.py",
 line 63, in deco
return f(*a, **kw)
  File 
"/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py",
 line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o302.showString.
: java.lang.UnsupportedOperationException: empty.reduceLeft
at 
scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:180)
at 
scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayBuffer.scala:48)
at 
scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:74)
at scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:48)
at 
scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:208)
at scala.collection.AbstractTraversable.reduce(Traversable.scala:104)
at 
org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:107)
at 
org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:71)
at scala.PartialFunction$Lifted.apply(PartialFunction.scala:223)
at scala.PartialFunction$Lifted.apply(PartialFunction.scala:219)
at 
org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:112)
at 
org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:111)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at 
org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.(InMemoryTableScanExec.scala:111)
at 
org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307)
at 
org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307)
at 
org.apache.spark.sql.execution.SparkPlanner.pruneFilterProject(SparkPlanner.scala:99)
at 
org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$.apply(SparkStrategies.scala:303)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonf

[jira] [Commented] (STORM-2077) KafkaSpout doesn't retry failed tuples

2016-12-01 Thread Andreas Maier (JIRA)


[ 
https://issues.apache.org/jira/browse/STORM-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15712237#comment-15712237
 ] 

Andreas Maier commented on STORM-2077:
--

Is this ticket related to STORM-2087 ?

> KafkaSpout doesn't retry failed tuples
> --
>
> Key: STORM-2077
> URL: https://issues.apache.org/jira/browse/STORM-2077
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-kafka
>Affects Versions: 1.0.2
>Reporter: Tobias Maier
>
> KafkaSpout does not retry all failed tuples.
> We used following Configuration:
> Map props = new HashMap<>();
> props.put(KafkaSpoutConfig.Consumer.GROUP_ID, "c1");
> props.put(KafkaSpoutConfig.Consumer.KEY_DESERIALIZER, 
> ByteArrayDeserializer.class.getName());
> props.put(KafkaSpoutConfig.Consumer.VALUE_DESERIALIZER, 
> ByteArrayDeserializer.class.getName());
> props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, 
> broker.bootstrapServer());
> KafkaSpoutStreams kafkaSpoutStreams = new 
> KafkaSpoutStreams.Builder(FIELDS_KAFKA_EVENT, new 
> String[]{"test-topic"}).build();
> KafkaSpoutTuplesBuilder kafkaSpoutTuplesBuilder = new 
> KafkaSpoutTuplesBuilder.Builder<>(new 
> KeyValueKafkaSpoutTupleBuilder("test-topic")).build();
> KafkaSpoutRetryService retryService = new 
> KafkaSpoutLoggedRetryExponentialBackoff(KafkaSpoutLoggedRetryExponentialBackoff.TimeInterval.milliSeconds(1),
>  KafkaSpoutLoggedRetryExponentialBackoff.TimeInterval.milliSeconds(1), 3, 
> KafkaSpoutLoggedRetryExponentialBackoff.TimeInterval.seconds(1));
> KafkaSpoutConfig config = new 
> KafkaSpoutConfig.Builder<>(props, kafkaSpoutStreams, kafkaSpoutTuplesBuilder, 
> retryService)
> .setFirstPollOffsetStrategy(UNCOMMITTED_LATEST)
> .setMaxUncommittedOffsets(30)
> .setOffsetCommitPeriodMs(10)
> .setMaxRetries(3)
> .build();
> kafkaSpout = new org.apache.storm.kafka.spout.KafkaSpout<>(config);
> The downstream bolt fails every tuple and we expect, that those tuple will 
> all be replayed. But that's not the case for every tuple.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (STORM-2087) Storm-kafka-client: Failed tuples are not always replayed

2016-11-30 Thread Andreas Maier (JIRA)


[ 
https://issues.apache.org/jira/browse/STORM-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15708199#comment-15708199
 ] 

Andreas Maier commented on STORM-2087:
--

Is this a duplicate of STORM-2077 ?

> Storm-kafka-client: Failed tuples are not always replayed 
> --
>
> Key: STORM-2087
> URL: https://issues.apache.org/jira/browse/STORM-2087
> Project: Apache Storm
>  Issue Type: Bug
>Reporter: Jeff Fenchel
>  Time Spent: 9h
>  Remaining Estimate: 0h
>
> I am working with kafka 10 and the storm-kafka-client from master. It appears 
> that tuples are not always being replayed when they are failed.
> With a topology that randomly fails tuples a small percentage of the time I 
> found that the committed kafka offset would get stuck and eventually 
> processing would stop even though the committed offset was no where near the 
> end of the topic. 
> I have also replicated the issue in unit tests with this PR: 
> https://github.com/apache/storm/pull/1679
> It seems that increasing the number of times I call nextTuple for the in 
> order case will make it work, but it doesn't seem to help the case where 
> tuples are failed out of order from which they were emitted. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AVRO-1927) If a default value is set, Avro allows null values in non-nullable fields.

2016-10-12 Thread Andreas Maier (JIRA)


[ 
https://issues.apache.org/jira/browse/AVRO-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15568879#comment-15568879
 ] 

Andreas Maier commented on AVRO-1927:
-

So I had a look at the generated Java code. And strangely enough, this code 
only throws an exception, if no default value is set:
{code}
   protected void validate(Field field, Object value) {
if(!isValidValue(field, value)) {
if(field.defaultValue() == null) {   // why this check?
throw new AvroRuntimeException("Field " + field + " does not 
accept null values");
}
  
}
}
{code}
I don't understand why Avro checks, if {{if(field.defaultValue() == null)}} 
before throwing an exception. In my opinion it should always throw an exception 
if the field value is invalid. 

> If a default value is set, Avro allows null values in non-nullable fields.
> --
>
> Key: AVRO-1927
> URL: https://issues.apache.org/jira/browse/AVRO-1927
> Project: Avro
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.8.1
>Reporter: Andreas Maier
>  Labels: newbie
>
> With an avro schema like
> {code}
> {
>   "name": "myfield",
>   "type": "string",
>   "default": ""
> }
> {code}
> the following code should throw an exception
> {code}
> MyObject myObject = MyObject.newBuilder().setMyfield(null).build();
> {code}
> But instead the value of myfield is set to null, which causes an exception 
> later when serializing myObject, because null is not a valid value for 
> myfield. 
> I believe in this case setMyfield(null) should throw an exception, 
> independent of the value of default. 
> See also
> https://stackoverflow.com/questions/38509279/generated-avro-builder-set-null-doesnt-overwrite-with-default



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AVRO-1927) If a default value is set, Avro allows null values in non-nullable fields.

2016-10-12 Thread Andreas Maier (JIRA)


[ 
https://issues.apache.org/jira/browse/AVRO-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15568867#comment-15568867
 ] 

Andreas Maier commented on AVRO-1927:
-

Sorry for the long delay. I was on vacation.
I tried the code you suggested. You are right, it does throw a 
NullPointerException when I try to write the object:
{code}
java.lang.NullPointerException: null of string of de.am.MyObject

at 
org.apache.avro.generic.GenericDatumWriter.npe(GenericDatumWriter.java:132)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:126)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:73)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:60)
at de.am.MyObjectTest.myObjectTest2(MyObjectTest.java:27)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:119)
at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:42)
at 
com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:234)
at 
com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
Caused by: java.lang.NullPointerException
at 
org.apache.avro.specific.SpecificDatumWriter.writeString(SpecificDatumWriter.java:67)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:115)
at 
org.apache.avro.specific.SpecificDatumWriter.writeField(SpecificDatumWriter.java:87)
at 
org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:143)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:105)
... 30 more
{code}

> If a default value is set, Avro allows null values in non-nullable fields.
> --
>
> Key: AVRO-1927
> URL: https://issues.apache.org/jira/browse/AVRO-1927
> Project: Avro
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.8.1
>Reporter: Andreas Maier
>  Labels: newbie
>
> With an avro schema like
> {code}
> {
>   "name": "myfield",
>   "type": "string",
>   "default": ""
> }
> {code}
> the following code should throw an exception
> {code}
> MyObject myObject = MyObject.newBuilder().setMyfield(null).build();
> {code}
> But instead the value of myfield is set to null, which causes an exception 
> later when serializing myObject, because null is not a valid value for 
> myfield. 
> I believe in this case setMyfield(null) should throw an exception, 
> independent of the value of default. 
> See also
> https://stackoverflow.com/questions/38509279/generated-avro-builder-set-null-doesnt-overwrite-with-default



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (STORM-2123) Adding a ByteKeyValueScheme class

2016-09-23 Thread Andreas Maier (JIRA)

Andreas Maier created STORM-2123:


 Summary: Adding a ByteKeyValueScheme class
 Key: STORM-2123
 URL: https://issues.apache.org/jira/browse/STORM-2123
 Project: Apache Storm
  Issue Type: Improvement
  Components: storm-kafka
Reporter: Andreas Maier


It is a common use case to send and receive byte key and values via Kafka. 
Unfortunately the storm-kafka package misses support for this. The pull request 
https://github.com/apache/storm/pull/1711 adds such a ByteKeyValueScheme class .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (AVRO-1927) If a default value is set, Avro allows null values in non-nullable fields.

2016-09-23 Thread Andreas Maier (JIRA)

Andreas Maier created AVRO-1927:
---

 Summary: If a default value is set, Avro allows null values in 
non-nullable fields.
 Key: AVRO-1927
 URL: https://issues.apache.org/jira/browse/AVRO-1927
 Project: Avro
  Issue Type: Bug
  Components: java
Affects Versions: 1.8.1
Reporter: Andreas Maier


With an avro schema like
{code}
{
  "name": "myfield",
  "type": "string",
  "default": ""
}
{code}
the following code should throw an exception
{code}
MyObject myObject = MyObject.newBuilder().setMyfield(null).build();
{code}
But instead the value of myfield is set to null, which causes an exception 
later when serializing myObject, because null is not a valid value for myfield. 

I believe in this case setMyfield(null) should throw an exception, independent 
of the value of default. 
See also
https://stackoverflow.com/questions/38509279/generated-avro-builder-set-null-doesnt-overwrite-with-default



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (STORM-2121) StringKeyValueScheme doesn't override getOutputFields()

2016-09-23 Thread Andreas Maier (JIRA)

Andreas Maier created STORM-2121:


 Summary: StringKeyValueScheme doesn't override getOutputFields()
 Key: STORM-2121
 URL: https://issues.apache.org/jira/browse/STORM-2121
 Project: Apache Storm
  Issue Type: Bug
  Components: storm-kafka
Reporter: Andreas Maier


In org.apache.storm.kafka StringKeyValueScheme extends StringScheme. However it 
doesn't override the method getOutputFields() from StringScheme 
{code}
public Fields getOutputFields() {
return new Fields(STRING_SCHEME_KEY);
}
{code}
And this method returns only one field, instead of two (one for key and one for 
value), which causes problems. 
It would be better to override the method getOutputFields() in 
StringKeyValueScheme with e.g. 
{code}
@Override
public Fields getOutputFields() {
return new Fields(FieldNameBasedTupleToKafkaMapper.BOLT_KEY, 
FieldNameBasedTupleToKafkaMapper.BOLT_MESSAGE);
}
{code}
The important thing is, that getOutputFields() should return two fields and not 
one. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SPARK-22517) NullPointerException in ShuffleExternalSorter.spill()

[jira] [Comment Edited] (SPARK-22627) Fix formatting of headers in configuration.html page

[jira] [Commented] (SPARK-22627) Fix formatting of headers in configuration.html page

[jira] [Created] (SPARK-22631) Consolidate all configuration properties into one page

[jira] [Created] (SPARK-22630) Consolidate all configuration properties into one page

[jira] [Created] (SPARK-22627) Fix formatting of headers in configuration.html page

[jira] [Commented] (SPARK-22616) df.cache() / df.persist() should have an option blocking like df.unpersist()

[jira] [Commented] (SPARK-22616) df.cache() / df.persist() should have an option blocking like df.unpersist()

[jira] [Created] (SPARK-22616) df.cache() / df.persist() should have an option blocking like df.unpersist()

[jira] [Created] (SPARK-22613) Make UNCACHE TABLE behaviour consistent with CACHE TABLE

[jira] [Commented] (SPARK-22517) NullPointerException in ShuffleExternalSorter.spill()

[jira] [Created] (SPARK-22517) NullPointerException in ShuffleExternalSorter.spill()

[jira] [Commented] (SPARK-22436) New function strip() to remove all whitespace from string

[jira] [Created] (SPARK-22436) New function strip() to remove all whitespace from string

[jira] [Created] (SPARK-22428) Document spark properties for configuring the ContextCleaner

[jira] [Created] (SPARK-22369) PySpark: Document methods of spark.catalog interface

[jira] [Commented] (SPARK-22249) UnsupportedOperationException: empty.reduceLeft when caching a dataframe

[jira] [Created] (SPARK-22249) UnsupportedOperationException: empty.reduceLeft when caching a dataframe

[jira] [Commented] (STORM-2077) KafkaSpout doesn't retry failed tuples

[jira] [Commented] (STORM-2087) Storm-kafka-client: Failed tuples are not always replayed

[jira] [Commented] (AVRO-1927) If a default value is set, Avro allows null values in non-nullable fields.

[jira] [Commented] (AVRO-1927) If a default value is set, Avro allows null values in non-nullable fields.

[jira] [Created] (STORM-2123) Adding a ByteKeyValueScheme class

[jira] [Created] (AVRO-1927) If a default value is set, Avro allows null values in non-nullable fields.

[jira] [Created] (STORM-2121) StringKeyValueScheme doesn't override getOutputFields()

25 matches

Site Navigation

Mail list logo

Footer information