Re: Spark 3.0 ArrayIndexOutOfBoundsException at RDDOperationScope.toJson

2020-06-29 Thread taegeonum
I've found the problem. I've removed guava14.0 from the extraClassPath in my spark job, and there is no exception. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

Spark 3.0 ArrayIndexOutOfBoundsException at RDDOperationScope.toJson

2020-06-29 Thread Taegeon Um
Hi, I’ve got the following exception when running a connected component example in Spark 3.0.0. This code runs without exception in Spark 2.4. It throws the exception when calling RDDOperationScope.toJson method. I’m not sure why it throws the execution in Spark 3.0. There is no exception in

[Debug] [Spark Core 2.4.4] org.apache.spark.storage.BlockException: Negative block size -9223372036854775808

2020-06-29 Thread Adam Tobey
Hi, I'm encountering a strange exception in spark 2.4.4 (on AWS EMR 5.29): org.apache.spark.storage.BlockException: Negative block size -9223372036854775808. I've seen this mostly from this line (for remote blocks)

Re: Spark Small file issue

2020-06-29 Thread Hichki
All 800 files(in a partition folder) sizes are in bytes. It will sum up to 200 MB which is each partition folder input size. And I am using ORC format. Never used Parquet format. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-29 Thread Sanjeev Mishra
Done. https://issues.apache.org/jira/browse/SPARK-32130 On Mon, Jun 29, 2020 at 8:21 AM Maxim Gekk wrote: > Hello Sanjeev, > > It is hard to troubleshoot the issue without input files. Could you open > an JIRA ticket at https://issues.apache.org/jira/projects/SPARK and > attach the JSON files

Re: Spark Small file issue

2020-06-29 Thread Bobby Evans
So I should have done some back of the napkin math before all of this. You are writing out 800 files, each < 128 MB. If they were 128 MB then it would be 100GB of data being written, I'm not sure how much hardware you have but, but the fact that you can shuffle about 100GB to a single thread and

unsubscribe

2020-06-29 Thread obaidul karim

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-29 Thread Maxim Gekk
Hello Sanjeev, It is hard to troubleshoot the issue without input files. Could you open an JIRA ticket at https://issues.apache.org/jira/projects/SPARK and attach the JSON files there (or samples or code which generates JSON files)? Maxim Gekk Software Engineer Databricks, Inc. On Mon, Jun

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-29 Thread Sanjeev Mishra
It has read everything. As you notice the timing of count is still smaller in Spark 2.4 Spark 2.4 scala> spark.time(spark.read.json("/data/20200528")) Time taken: 19691 ms res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields] scala> spark.time(res61.count())

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-29 Thread ArtemisDev
Could you share your code?  Are you sure you Spark 2.4 cluster had indeed read anything?  Looks like the Input size field is empty under 2.4. -- ND On 6/27/20 7:58 PM, Sanjeev Mishra wrote: I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33

Re: Spark Small file issue

2020-06-29 Thread Hichki
Hi, I am doing repartition at the end. I mean before insert overwriting the table. I see the last step (repartition) is taking more time. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe

File Not Found: /tmp/spark-events in Spark 3.0

2020-06-29 Thread ArtemisDev
While launching a spark job from Zeppelin against a standalone spark cluster (Spark 3.0 with multiple workers without hadoop), we have encountered a Spark interpreter exception caused by a I/O File Not Found exception due to the non-existence of the /tmp/spark-events directory.  We had to

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-29 Thread Sanjeev Mishra
There is not much code, I am using spark-shell provided by Spark 2.4 and Spark 3. val dp = spark.read.json("/Users//data/dailyparams/20200528") On Mon, Jun 29, 2020 at 2:25 AM Gourav Sengupta wrote: > Hi, > > can you please share the SPARK code? > > > > Regards, > Gourav > > On Sun, Jun 28,

Announcing ApacheCon @Home 2020

2020-06-29 Thread Rich Bowen
Hi, Apache enthusiast! (You’re receiving this because you’re subscribed to one or more dev or user mailing lists for an Apache Software Foundation project.) The ApacheCon Planners and the Apache Software Foundation are pleased to announce that ApacheCon @Home will be held online, September

Re: java.lang.ClassNotFoundException for s3a comitter

2020-06-29 Thread Steve Loughran
you are going to need hadoop-3.1 on your classpath, with hadoop-aws and the same aws-sdk it was built with (1.11.something). Mixing hadoop JARs is doomed. using a different aws sdk jar is a bit risky, though more recent upgrades have all be fairly low stress On Fri, 19 Jun 2020 at 05:39, murat

Re: Spark 3 pod template for the driver

2020-06-29 Thread Michel Sumbul
Hello, Adding the dev mailing list maybe there is someone here that can help to have/show a valid/accepted pod template for spark 3? Thanks in advance, Michel Le ven. 26 juin 2020 à 14:03, Michel Sumbul a écrit : > Hi Jorge, > If I set that in the spark submit command it works but I want it

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-29 Thread Gourav Sengupta
Hi, can you please share the SPARK code? Regards, Gourav On Sun, Jun 28, 2020 at 12:58 AM Sanjeev Mishra wrote: > > I have large amount of json files that Spark can read in 36 seconds but > Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, > looks like Spark 3.0 is