[SHUFFLE]FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2019-03-11 Thread wangfei
Hi all, Non-deterministic FAILED_TO_UNCOMPRESS(5) or ’Stream is corrupted’ errors may occur during shuffle read, described as this JIRA(https://issues.apache.org/jira/browse/SPARK-4105). There is not new comment for a long time in this JIRA. So, Is there anyone seen these errors in

jenkins locale issue

2019-03-11 Thread Yuming Wang
Why jenkins locale is: LANG= LC_CTYPE="POSIX" LC_NUMERIC="POSIX" LC_TIME="POSIX" LC_COLLATE="POSIX" LC_MONETARY="POSIX" LC_MESSAGES="POSIX" LC_PAPER="POSIX" LC_NAME="POSIX" LC_ADDRESS="POSIX" LC_TELEPHONE="POSIX" LC_MEASUREMENT="POSIX" LC_IDENTIFICATION="POSIX Hadoop will throw InvalidPathExcept

Re: Benchmark Java/Scala/Python for Apache spark

2019-03-11 Thread Reynold Xin
If you use UDFs in Python, you would want to use Pandas UDF for better performance. On Mon, Mar 11, 2019 at 7:50 PM Jonathan Winandy wrote: > Thanks, I didn't know! > > That being said, any udf use seems to affect badly code generation (and > the performance). > > > On Mon, 11 Mar 2019, 15:13 Dy

Re: Benchmark Java/Scala/Python for Apache spark

2019-03-11 Thread Jonathan Winandy
Thanks, I didn't know! That being said, any udf use seems to affect badly code generation (and the performance). On Mon, 11 Mar 2019, 15:13 Dylan Guedes, wrote: > Btw, even if you are using Python you can register your UDFs in Scala and > use them in Python. > > On Mon, Mar 11, 2019 at 6:55 AM

RE: [External] Re: [Spark RPC] Help how to debug sudden performance issue

2019-03-11 Thread Hough, Stephen C
The problem is located within \scheduler\cluster\CoarseGrainedSchedulerBackend.scala on the receive function, StatusUpdate. When my incident occurs the scheduler becomes effectively single threaded processing 80k continuous messages Take 3 of those consecutive messages 04-03-19 22:02:43:037 [

Re: Java 11 support

2019-03-11 Thread Sean Owen
Search JIRA ... https://issues.apache.org/jira/browse/SPARK-24417 On Mon, Mar 11, 2019 at 1:03 PM Sudhir Menon wrote: > > Is there a timeline for Spark 3.0? > Or more specifically, is there a timeline for moving to Java 9 and beyond? > > Thanks in advance > Suds > > > > On Tue, Nov 6, 2018 at 9:1

Re: Java 11 support

2019-03-11 Thread Sudhir Menon
Is there a timeline for Spark 3.0? Or more specifically, is there a timeline for moving to Java 9 and beyond? Thanks in advance Suds On Tue, Nov 6, 2018 at 9:16 AM Felix Cheung wrote: > +1 for Spark 3, definitely > Thanks for the updates > > > -- > *From:* Sean Owe

Re: Benchmark Java/Scala/Python for Apache spark

2019-03-11 Thread Dylan Guedes
Btw, even if you are using Python you can register your UDFs in Scala and use them in Python. On Mon, Mar 11, 2019 at 6:55 AM Jonathan Winandy wrote: > Hello Snehasish > > If you are not using UDFs, you will have very similar performance with > those languages on SQL. > > So it go down to : > *

Re: Benchmark Java/Scala/Python for Apache spark

2019-03-11 Thread Jonathan Winandy
Hello Snehasish If you are not using UDFs, you will have very similar performance with those languages on SQL. So it go down to : * if you know python, go for python. * if you are used to the JVM, and are ready for a bit of paradigm shift, go for Scala. Our team is using Scala, however we help o

Benchmark Java/Scala/Python for Apache spark

2019-03-11 Thread SNEHASISH DUTTA
Hi Is there a way to get performance benchmarks for development of application using either Java/Scala/Python Use case mostly involve SQL pipeline/data ingested from various sources including Kafka What should be the most preferred language and it would be great if the preference for language ca

Re: [External] Re: [Spark RPC] Help how to debug sudden performance issue

2019-03-11 Thread Jörn Franke
Well it will be difficult to say anything without knowing func. It could be that 40 cores and 200 gb for an executor is not a setup that suits the func and the overall architecture. It could be also GC collection issues etc. Sometimes it also does not help to throw hardware at the issue. It de

RE: [External] Re: [Spark RPC] Help how to debug sudden performance issue

2019-03-11 Thread Hough, Stephen C
Thanks There is no issue on the worker/executor side they have ample memory > 200GB, I gave that information as background to the system apologies for the confusion. The problem is isolated to the lifetime of processing a DriverEndpoint StatusUpdate message. For 40 minutes the system runs fin

Re: [Spark RPC] Help how to debug sudden performance issue

2019-03-11 Thread Jörn Franke
Well it is a little bit difficult to say, because a lot of things are mixing up here. What function is calculated? Does it need a lot of memory? Could it be that you run out of memory and some spillover happens and you have a lot of IO to disk which is blocking? Related to that could be 1 exec