Re: Executorlost failure

2022-04-07 Thread Wes Peng
I just did a test, even for a single node (local deployment), spark can handle the data whose size is much larger than the total memory. My test VM (2g ram, 2 cores): $ free -m totalusedfree shared buff/cache available Mem: 19921845

Re: Executorlost failure

2022-04-07 Thread rajat kumar
With autoscaling can have any numbers of executors. Thanks On Fri, Apr 8, 2022, 08:27 Wes Peng wrote: > I once had a file which is 100+GB getting computed in 3 nodes, each node > has 24GB memory only. And the job could be done well. So from my > experience spark cluster seems to work correctly

Re: Aggregate over a column: the proper way to do

2022-04-07 Thread sam smith
My bad, yes of course that! still i don't like the .. select("count(myCol)") .. part in my line is there any replacement to that ? Le ven. 8 avr. 2022 à 06:13, Sean Owen a écrit : > Just do an average then? Most of my point is that filtering to one group > and then grouping is pointless. > > On

Re: Aggregate over a column: the proper way to do

2022-04-07 Thread sam smith
What if i do avg instead of count? Le ven. 8 avr. 2022 à 05:32, Sean Owen a écrit : > Wait, why groupBy at all? After the filter only rows with myCol equal to > your target are left. There is only one group. Don't group just count after > the filter? > > On Thu, Apr 7, 2022, 10:27 PM sam smith

Re: Aggregate over a column: the proper way to do

2022-04-07 Thread Sean Owen
Wait, why groupBy at all? After the filter only rows with myCol equal to your target are left. There is only one group. Don't group just count after the filter? On Thu, Apr 7, 2022, 10:27 PM sam smith wrote: > I want to aggregate a column by counting the number of rows having the > value

Aggregate over a column: the proper way to do

2022-04-07 Thread sam smith
I want to aggregate a column by counting the number of rows having the value "myTargetValue" and return the result I am doing it like the following:in JAVA > long result = >

Re: Executorlost failure

2022-04-07 Thread Wes Peng
I once had a file which is 100+GB getting computed in 3 nodes, each node has 24GB memory only. And the job could be done well. So from my experience spark cluster seems to work correctly for big files larger than memory by swapping them to disk. Thanks rajat kumar wrote: Tested this with

Re: Executorlost failure

2022-04-07 Thread Wes Peng
how many executors do you have? rajat kumar wrote: Tested this with executors of size 5 cores, 17GB memory. Data vol is really high around 1TB - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

negative time duration in event log accumulables

2022-04-07 Thread wangcheng (AK)
Hi, I'm running Spark 2.4.4. When I execute a simple query "select * from table group by col", I found the SparkListenerTaskEnd event in event log reports all negative time duration for aggregate time total: {"ID":6,"Name":"aggregate time total (min, med,

Re: Executorlost failure

2022-04-07 Thread rajat kumar
Tested this with executors of size 5 cores, 17GB memory. Data vol is really high around 1TB Thanks Rajat On Thu, Apr 7, 2022, 23:43 rajat kumar wrote: > Hello Users, > > I got following error, tried increasing executor memory and memory > overhead that also did not help . > > ExecutorLost

Executorlost failure

2022-04-07 Thread rajat kumar
Hello Users, I got following error, tried increasing executor memory and memory overhead that also did not help . ExecutorLost Failure(executor1 exited caused by one of the following tasks) Reason: container from a bad node: java.lang.OutOfMemoryError: enough memory for aggregation Can

Re: Spark 3.0.1 and spark 3.2 compatibility

2022-04-07 Thread Sean Owen
(Don't cross post please) Generally you definitely want to compile and test vs what you're running on. There shouldn't be many binary or source incompatibilities -- these are avoided in a major release where possible. So it may need no code change. But I would certainly recompile just on

Spark 3.0.1 and spark 3.2 compatibility

2022-04-07 Thread Pralabh Kumar
Hi spark community I have quick question .I am planning to migrate from spark 3.0.1 to spark 3.2. Do I need to recompile my application with 3.2 dependencies or application compiled with 3.0.1 will work fine on 3.2 ? Regards Pralabh kumar

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Mich Talebzadeh
Since your Hbase is supported by the external vendor, I would ask them to justify their choice of storage for Hbase and any suggestion they have vis-a-vis S3 etc. Spark has an efficient API to Hbase including remote Hbase. I have used in the past reading from Hbase. HTH view my Linkedin

Re: query time comparison to several SQL engines

2022-04-07 Thread James Turton
What might be the biggest factor affecting running time here is that Drill's query execution is not fault tolerant while Spark's is.  The philosophy is different, Drill's says "when you're doing interactive analytics and a node dies, killing your query as it goes, just run the query again."

Re: query time comparison to several SQL engines

2022-04-07 Thread Jacek Laskowski
Hi Wes, Thanks for the report! I like it (mostly because it's short and concise). Thank you. I know nothing about Drill and am curious about the similar execution times and this sentence in the report: "Spark is the second fastest, that should be reasonable, since both Spark and Drill have

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Joris Billen
Thanks for pointing this out. So currently data is stored in hbase on adls. Question (sorry I might be ignorant): is it clear that parquet on s3 would be faster as storage to read from than hbase on adls? In general, I ve found it hard after my processing is done, if I have an application that

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Bjørn Jørgensen
"4. S3: I am not using it, but people in the thread started suggesting potential solutions involving s3. It is an azure system, so hbase is stored on adls. In fact the nature of my application (geospatial stuff) requires me to use geomesa libs, which only allows directly writing from spark to

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Mich Talebzadeh
Ok. Your architect has decided to emulate anything on prem to the cloud.You are not really taking any advantages of cloud offerings or scalability. For example, how does your Hadoop clustercater for the increased capacity. Likewise your spark nodes are pigeonholed with your Hadoop nodes. Old wine

query time comparison to several SQL engines

2022-04-07 Thread Wes Peng
I made a simple test to query time for several SQL engines including mysql, hive, drill and spark. The report, https://cloudcache.net/data/query-time-mysql-hive-drill-spark.pdf It maybe have no special meaning, just for fun. :) regards.

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Bjørn Jørgensen
"But it will be faster to use S3 (or GCS) through some network and it will be faster than writing to the local SSD. I don't understand the point here." Minio is a S3 mock, so you run minio local. tor. 7. apr. 2022 kl. 09:27 skrev Mich Talebzadeh : > Ok so that is your assumption. The whole thing

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Joris Billen
Thanks for active discussion and sharing your knowledge :-) 1.Cluster is a managed hadoop cluster on Azure in the cloud. It has hbase, and spark, and hdfs shared . 2.Hbase is on the cluster, so not standalone. It comes from an enterprise-level template from a commercial vendor, so assuming

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Mich Talebzadeh
Ok so that is your assumption. The whole thing is based on-premise on JBOD (including hadoop cluster which has Spark binaries on each node as I understand) as I understand. But it will be faster to use S3 (or GCS) through some network and it will be faster than writing to the local SSD. I don't

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Bjørn Jørgensen
1. Where does S3 come into this He is processing data for each day at a time. So to dump each day to a fast storage he can use parquet files and write it to S3. ons. 6. apr. 2022 kl. 22:27 skrev Mich Talebzadeh : > > Your statement below: > > > I believe I have found the issue: the job