Re: [StructuredStreaming] HDFSBackedStateStoreProvider is leaking .crc files.

2019-06-12 Thread Gerard Maas
Ooops - linked the wrong JIRA ticket: (that other one is related) https://issues.apache.org/jira/browse/SPARK-28025 On Wed, Jun 12, 2019 at 1:21 PM Gerard Maas wrote: > Hi! > I would like to socialize this issue we are currently facing: > The Structured Streaming default CheckpointFi

[StructuredStreaming] HDFSBackedStateStoreProvider is leaking .crc files.

2019-06-12 Thread Gerard Maas
Hi! I would like to socialize this issue we are currently facing: The Structured Streaming default CheckpointFileManager leaks .crc files by leaving them behind after users of this class (like HDFSBackedStateStoreProvider) apply their cleanup methods. This results in an unbounded creation of tiny

[Structured Streaming] File source, Parquet format: use of the mergeSchema option.

2018-04-11 Thread Gerard Maas
Hi, I'm looking into the Parquet format support for the File source in Structured Streaming. The docs mention the use of the option 'mergeSchema' to merge the schemas of the part files found.[1] What would be the practical use of that in a streaming context? In its batch counterpart,

[Structured Streaming] OOM on ConsoleSink with large inputs

2017-08-11 Thread Gerard Maas
Devs, While investigating another issue, I came across this OOM error when using the Console Sink with any source that can be larger than the available driver memory. In my case, I was using the File source and I had a 14G file in the monitored dir. I traced back the issue to a `df.collect` in

Re: Handling questions in the mailing lists

2016-11-09 Thread Gerard Maas
Great discussion. Glad to see it happening and lucky to have seen it on the mailing list due to its high volume. I had this same conversation with Patrick Wendell few Spark Summits ago. At the time, SO was not even listed as a resource and the idea was to make it the primary "go-to" place for

Re: Can we remove private[spark] from Metrics Source and SInk traits?

2016-03-19 Thread Gerard Maas
+1 On Mar 19, 2016 08:33, "Pete Robbins" wrote: > This seems to me to be unnecessarily restrictive. These are very useful > extension points for adding 3rd party sources and sinks. > > I intend to make an Elasticsearch sink available on spark-packages but > this will require

Re: Time is ugly in Spark Streaming....

2015-06-26 Thread Gerard Maas
Are you sharing the SimpleDateFormat instance? This looks a lot more like the non-thread-safe behaviour of SimpleDateFormat (that has claimed many unsuspecting victims over the years), than any 'ugly' Spark Streaming. Try writing the timestamps in millis to Kafka and compare. -kr, Gerard. On

Re: Stages with non-arithmetic numbering Timing metrics in event logs

2015-06-11 Thread Gerard Maas
Kay, Excellent write-up. This should be preserved for reference somewhere searchable. -Gerard. On Fri, Jun 12, 2015 at 1:19 AM, Kay Ousterhout k...@eecs.berkeley.edu wrote: Here’s how the shuffle works. This explains what happens for a single task; this will happen in parallel for each

Re: [Streaming] Configure executor logging on Mesos

2015-06-01 Thread Gerard Maas
the spark.executor.uri (or a another one) can take more than one downloadable path. my.2¢ andy On Fri, May 29, 2015 at 5:09 PM Gerard Maas gerard.m...@gmail.com wrote: Hi Tim, Thanks for the info. We (Andy Petrella and myself) have been diving a bit deeper into this log config: The log

Re: Registering custom metrics

2015-01-08 Thread Gerard Maas
) bytes } .saveAsTextFile(text) Is there a way to achieve this with the MetricSystem? ᐧ On Mon, Jan 5, 2015 at 10:24 AM, Gerard Maas gerard.m...@gmail.com wrote: Hi, Yes, I managed to create a register custom metrics by creating an implementation

Re: Registering custom metrics

2015-01-05 Thread Gerard Maas
Hi, Yes, I managed to create a register custom metrics by creating an implementation of org.apache.spark.metrics.source.Source and registering it to the metrics subsystem. Source is [Spark] private, so you need to create it under a org.apache.spark package. In my case, I'm dealing with Spark

Tuning Spark Streaming jobs

2014-12-22 Thread Gerard Maas
Hi, After facing issues with the performance of some of our Spark Streaming jobs, we invested quite some effort figuring out the factors that affect the performance characteristics of a Streaming job. We defined an empirical model that helps us reason about Streaming jobs and applied it to tune

Re: Tuning Spark Streaming jobs

2014-12-22 Thread Gerard Maas
mode? I'm making changes to the spark mesos scheduler and I think we can propose a best way to achieve what you mentioned. Tim Sent from my iPhone On Dec 22, 2014, at 8:33 AM, Gerard Maas gerard.m...@gmail.com wrote: Hi, After facing issues with the performance of some of our Spark

Understanding reported times on the Spark UI [+ Streaming]

2014-12-08 Thread Gerard Maas
Hi, I'm confused about the Stage times reported on the Spark-UI (Spark 1.1.0) for an Spark-Streaming job. I'm hoping somebody can shine some light on it: Let's do this with an example: On the /stages page, stage # 232 is reported to have lasted 18 seconds: 232runJob at RDDFunctions.scala:23

Re: Spark Streaming Metrics

2014-11-21 Thread Gerard Maas
Looks like metrics are not a hot topic to discuss - yet so important to sleep well when jobs are running in production. I've created Spark-4537 https://issues.apache.org/jira/browse/SPARK-4537 to track this issue. -kr, Gerard. On Thu, Nov 20, 2014 at 9:25 PM, Gerard Maas gerard.m...@gmail.com

Spark Streaming Metrics

2014-11-20 Thread Gerard Maas
As the Spark Streaming tuning guide indicates, the key indicators of a healthy streaming job are: - Processing Time - Total Delay The Spark UI page for the Streaming job [1] shows these two indicators but the metrics source for Spark Streaming (StreamingSource.scala) [2] does not. Any reasons

Registering custom metrics

2014-10-30 Thread Gerard Maas
vHi, I've been exploring the metrics exposed by Spark and I'm wondering whether there's a way to register job-specific metrics that could be exposed through the existing metrics system. Would there be an example somewhere? BTW, documentation about how the metrics work could be improved. I

Using case classes as keys does not seem to work.

2014-07-22 Thread Gerard Maas
Using a case class as a key doesn't seem to work properly. [Spark 1.0.0] A minimal example: case class P(name:String) val ps = Array(P(alice), P(bob), P(charly), P(bob)) sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect [Spark shell local mode] res : Array[(P, Int)] =

Re: Using case classes as keys does not seem to work.

2014-07-22 Thread Gerard Maas
,ArrayBuffer(1, 1))) On Tue, Jul 22, 2014 at 4:20 PM, Gerard Maas gerard.m...@gmail.com wrote: Using a case class as a key doesn't seem to work properly. [Spark 1.0.0] A minimal example: case class P(name:String) val ps = Array(P(alice), P(bob), P(charly), P(bob)) sc.parallelize(ps).map(x= (x,1

Re: Using case classes as keys does not seem to work.

2014-07-22 Thread Gerard Maas
, 2014 at 5:37 PM, Gerard Maas gerard.m...@gmail.com wrote: Yes, right. 'sc.parallelize(ps).map(x= (**x.name**,1)).groupByKey(). collect' An oversight from my side. Thanks!, Gerard. On Tue, Jul 22, 2014 at 5:24 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: I can confirm this bug

Re: Should SPARK_HOME be needed with Mesos?

2014-05-22 Thread Gerard Maas
send in a pull request that includes your proposed changes? Andrew On Wed, May 21, 2014 at 10:19 AM, Gerard Maas gerard.m...@gmail.com wrote: Spark dev's, I was looking into a question asked on the user list where a ClassNotFoundException was thrown when running a job on Mesos

Re: Should SPARK_HOME be needed with Mesos?

2014-05-22 Thread Gerard Maas
a new ticket for just this particular issue. On Thu, May 22, 2014 at 11:03 AM, Gerard Maas gerard.m...@gmail.comwrote: Sure. Should I create a Jira as well? I saw there's already a broader ticket regarding the ambiguous use of SPARK_HOME [1] (cc: Patrick as owner of that ticket) I don't

Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Gerard Maas
Hi Tobias, I was curious about this issue and tried to run your example on my local Mesos. I was able to reproduce your issue using your current config: [error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task 1.0:4 failed 4 times (most recent failure: Exception failure:

Should SPARK_HOME be needed with Mesos?

2014-05-21 Thread Gerard Maas
Spark dev's, I was looking into a question asked on the user list where a ClassNotFoundException was thrown when running a job on Mesos. Curious issue with serialization on Mesos: more details here [1]: When trying to run that simple example on my Mesos installation, I faced another issue: I got

Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Gerard Maas
for it to work. The SparkREPL works differently. It uses some dark magic to send the working session to the workers. -kr, Gerard. On Wed, May 21, 2014 at 2:47 PM, Gerard Maas gerard.m...@gmail.com wrote: Hi Tobias, I was curious about this issue and tried to run your example on my local

Re: Announcing the official Spark Job Server repo

2014-03-19 Thread Gerard Maas
this is cool +1 On Wed, Mar 19, 2014 at 6:54 PM, Patrick Wendell pwend...@gmail.com wrote: Evan - yep definitely open a JIRA. It would be nice to have a contrib repo set-up for the 1.0 release. On Tue, Mar 18, 2014 at 11:28 PM, Evan Chan e...@ooyala.com wrote: Matei, Maybe it's time