Re: Executorlost failure

2022-04-07 Thread Wes Peng
I just did a test, even for a single node (local deployment), spark can 
handle the data whose size is much larger than the total memory.


My test VM (2g ram, 2 cores):

$ free -m
  totalusedfree  shared  buff/cache 
available
Mem:   19921845  92  19  54 
 36

Swap:  1023 285 738


The data size:

$ du -h rate.csv
3.2Grate.csv


Loading this file into spark for calculation can be done without error:

scala> val df = spark.read.format("csv").option("inferSchema", 
true).load("skydrive/rate.csv")
val df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 2 
more fields]


scala> df.printSchema
warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: double (nullable = true)
 |-- _c3: integer (nullable = true)


scala> 
df.groupBy("_c1").agg(avg("_c2").alias("avg_rating")).orderBy(desc("avg_rating")).show
warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
+--+--+ 


|   _c1|avg_rating|
+--+--+
|000136|   5.0|
|0001711474|   5.0|
|0001360779|   5.0|
|0001006657|   5.0|
|0001361155|   5.0|
|0001018043|   5.0|
|000136118X|   5.0|
|202010|   5.0|
|0001371037|   5.0|
|401048|   5.0|
|0001371045|   5.0|
|0001203010|   5.0|
|0001381245|   5.0|
|0001048236|   5.0|
|0001436163|   5.0|
|000104897X|   5.0|
|0001437879|   5.0|
|0001056107|   5.0|
|0001468685|   5.0|
|0001061240|   5.0|
+--+--+
only showing top 20 rows


So as you see spark can handle file larger than its memory well. :)

Thanks


rajat kumar wrote:

With autoscaling can have any numbers of executors.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Executorlost failure

2022-04-07 Thread rajat kumar
With autoscaling can have any numbers of executors.

Thanks

On Fri, Apr 8, 2022, 08:27 Wes Peng  wrote:

> I once had a file which is 100+GB getting computed in 3 nodes, each node
> has 24GB memory only. And the job could be done well. So from my
> experience spark cluster seems to work correctly for big files larger
> than memory by swapping them to disk.
>
> Thanks
>
> rajat kumar wrote:
> > Tested this with executors of size 5 cores, 17GB memory. Data vol is
> > really high around 1TB
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Aggregate over a column: the proper way to do

2022-04-07 Thread sam smith
My bad, yes of course that! still i don't like the ..
select("count(myCol)") .. part in my line is there any replacement to that ?

Le ven. 8 avr. 2022 à 06:13, Sean Owen  a écrit :

> Just do an average then? Most of my point is that filtering to one group
> and then grouping is pointless.
>
> On Thu, Apr 7, 2022, 11:10 PM sam smith 
> wrote:
>
>> What if i do avg instead of count?
>>
>> Le ven. 8 avr. 2022 à 05:32, Sean Owen  a écrit :
>>
>>> Wait, why groupBy at all? After the filter only rows with myCol equal to
>>> your target are left. There is only one group. Don't group just count after
>>> the filter?
>>>
>>> On Thu, Apr 7, 2022, 10:27 PM sam smith 
>>> wrote:
>>>
 I want to aggregate a column by counting the number of rows having the
 value "myTargetValue" and return the result
 I am doing it like the following:in JAVA

> long result =
> dataset.filter(dataset.col("myCol").equalTo("myTargetVal")).groupBy(col("myCol")).agg(count(dataset.col("myCol"))).select("count(myCol)").first().getLong(0);


 Is that the right way? if no, what if a more optimized way to do that
 (always in JAVA)?
 Thanks for the help.

>>>


Re: Aggregate over a column: the proper way to do

2022-04-07 Thread sam smith
What if i do avg instead of count?

Le ven. 8 avr. 2022 à 05:32, Sean Owen  a écrit :

> Wait, why groupBy at all? After the filter only rows with myCol equal to
> your target are left. There is only one group. Don't group just count after
> the filter?
>
> On Thu, Apr 7, 2022, 10:27 PM sam smith 
> wrote:
>
>> I want to aggregate a column by counting the number of rows having the
>> value "myTargetValue" and return the result
>> I am doing it like the following:in JAVA
>>
>>> long result =
>>> dataset.filter(dataset.col("myCol").equalTo("myTargetVal")).groupBy(col("myCol")).agg(count(dataset.col("myCol"))).select("count(myCol)").first().getLong(0);
>>
>>
>> Is that the right way? if no, what if a more optimized way to do that
>> (always in JAVA)?
>> Thanks for the help.
>>
>


Re: Aggregate over a column: the proper way to do

2022-04-07 Thread Sean Owen
Wait, why groupBy at all? After the filter only rows with myCol equal to
your target are left. There is only one group. Don't group just count after
the filter?

On Thu, Apr 7, 2022, 10:27 PM sam smith  wrote:

> I want to aggregate a column by counting the number of rows having the
> value "myTargetValue" and return the result
> I am doing it like the following:in JAVA
>
>> long result =
>> dataset.filter(dataset.col("myCol").equalTo("myTargetVal")).groupBy(col("myCol")).agg(count(dataset.col("myCol"))).select("count(myCol)").first().getLong(0);
>
>
> Is that the right way? if no, what if a more optimized way to do that
> (always in JAVA)?
> Thanks for the help.
>


Aggregate over a column: the proper way to do

2022-04-07 Thread sam smith
I want to aggregate a column by counting the number of rows having the
value "myTargetValue" and return the result
I am doing it like the following:in JAVA

> long result =
> dataset.filter(dataset.col("myCol").equalTo("myTargetVal")).groupBy(col("myCol")).agg(count(dataset.col("myCol"))).select("count(myCol)").first().getLong(0);


Is that the right way? if no, what if a more optimized way to do that
(always in JAVA)?
Thanks for the help.


Re: Executorlost failure

2022-04-07 Thread Wes Peng
I once had a file which is 100+GB getting computed in 3 nodes, each node 
has 24GB memory only. And the job could be done well. So from my 
experience spark cluster seems to work correctly for big files larger 
than memory by swapping them to disk.


Thanks

rajat kumar wrote:
Tested this with executors of size 5 cores, 17GB memory. Data vol is 
really high around 1TB


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Executorlost failure

2022-04-07 Thread Wes Peng

how many executors do you have?

rajat kumar wrote:
Tested this with executors of size 5 cores, 17GB memory. Data vol is 
really high around 1TB


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



negative time duration in event log accumulables

2022-04-07 Thread wangcheng (AK)
Hi,

I'm running Spark 2.4.4. When I execute a simple query "select * from table 
group by col", I found the SparkListenerTaskEnd event in event log reports all 
negative time duration for aggregate time total:

{"ID":6,"Name":"aggregate time total (min, med, 
max)","Update":"2","Value":"-46","Internal":true,"Count Failed 
Values":true,"Metadata":"sql"}

The same thing happens in SparkListenerStageCompleted event:

{"ID":6,"Name":"aggregate time total (min, med, 
max)","Value":"-133","Internal":true,"Count Failed 
Values":true,"Metadata":"sql"}

Then I checked the history server web UI, but the SQL tab displays positive 
numbers for the HashAggregate operator:

aggregate time total (min, med, max): 35 ms (0 ms, 2 ms, 6 ms)

I'm wondering is this a bug in Spark 2.4? If not, how does Spark compute the 
"aggregate time total" from those negative numbers?

Thanks


Re: Executorlost failure

2022-04-07 Thread rajat kumar
Tested this with executors of size 5 cores, 17GB memory. Data vol is really
high around 1TB

Thanks
Rajat

On Thu, Apr 7, 2022, 23:43 rajat kumar  wrote:

> Hello Users,
>
> I got following error, tried increasing executor memory and memory
> overhead that also did not help .
>
> ExecutorLost Failure(executor1 exited caused by one of the following
> tasks) Reason: container from a bad node:
>
> java.lang.OutOfMemoryError: enough memory for aggregation
>
>
> Can someone please suggest ?
>
> Thanks
> Rajat
>


Executorlost failure

2022-04-07 Thread rajat kumar
Hello Users,

I got following error, tried increasing executor memory and memory overhead
that also did not help .

ExecutorLost Failure(executor1 exited caused by one of the following tasks)
Reason: container from a bad node:

java.lang.OutOfMemoryError: enough memory for aggregation


Can someone please suggest ?

Thanks
Rajat


Re: Spark 3.0.1 and spark 3.2 compatibility

2022-04-07 Thread Sean Owen
(Don't cross post please)
Generally you definitely want to compile and test vs what you're running on.
There shouldn't be many binary or source incompatibilities -- these are
avoided in a major release where possible. So it may need no code change.
But I would certainly recompile just on principle!

On Thu, Apr 7, 2022 at 12:28 PM Pralabh Kumar 
wrote:

> Hi spark community
>
> I have quick question .I am planning to migrate from spark 3.0.1 to spark
> 3.2.
>
> Do I need to recompile my application with 3.2 dependencies or application
> compiled with 3.0.1 will work fine on 3.2 ?
>
>
> Regards
> Pralabh kumar
>
>


Spark 3.0.1 and spark 3.2 compatibility

2022-04-07 Thread Pralabh Kumar
Hi spark community

I have quick question .I am planning to migrate from spark 3.0.1 to spark
3.2.

Do I need to recompile my application with 3.2 dependencies or application
compiled with 3.0.1 will work fine on 3.2 ?


Regards
Pralabh kumar


Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Mich Talebzadeh
Since your Hbase is supported by the external vendor, I would ask them to
justify their choice of storage for Hbase and any suggestion they have
vis-a-vis S3 etc.

Spark has an efficient API to Hbase including remote Hbase. I have used in
the past reading from Hbase.


HTH




   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 7 Apr 2022 at 13:38, Joris Billen 
wrote:

> Thanks for pointing this out.
>
> So currently data is stored in hbase on adls. Question (sorry I might be
> ignorant): is it clear that parquet on s3 would be faster as storage to
> read from than hbase on adls?
> In general, I ve found it hard after my processing is done, if I have an
> application that needs to read all data from hbase (full large tables) to
> get this as fast as possible.
> This read speed is important to me, but it is limited (I think) by the
> time it will take to read the data from the cloud storage (adls).
> You can change some parameters (like regionserver heap , block.cache size,
>  memstore global size) depending on if you are processing/write a lot OR
> reading from hbase. What I would find really useful if one of these
> autoscaling systems could also optimize these parameters depending if youre
> reading or writing.
>
>
>
> Wrt architecture: indeed separate spark from hbase would be best , but I
> never got it to write from an outside spark cluster.  For autoscaling, I
> know there are hbase cloud offerings that have elastic scaling so indeed
> that could be an improvement too.
>
>
>
> ANyhow, fruitful discussion.
>
>
>
>
>
> On 7 Apr 2022, at 13:46, Bjørn Jørgensen  wrote:
>
> "4. S3: I am not using it, but people in the thread started suggesting
> potential solutions involving s3. It is an azure system, so hbase is stored
> on adls. In fact the nature of my application (geospatial stuff) requires
> me to use geomesa libs, which only allows directly writing from spark to
> hbase. So I can not write to some other format (the geomesa API is not
> designed for that-it only writes directly to hbase using the predetermined
> key/values)."
>
> In the docs for geomesa it looks like it can write to files. They say to
> AWS which S3 is a part of and " The quick start comes pre-configured to
> use Apache’s Parquet encoding."
>
> http://www.geomesa.org/documentation/current/tutorials/geomesa-quickstart-fsds.html
> 
>
>
>
> tor. 7. apr. 2022 kl. 13:30 skrev Mich Talebzadeh <
> mich.talebza...@gmail.com>:
>
>> Ok. Your architect has decided to emulate anything on prem to the
>> cloud.You are not really taking any advantages of cloud offerings or
>> scalability. For example, how does your Hadoop clustercater for the
>> increased capacity. Likewise your spark nodes are pigeonholed with your
>> Hadoop nodes.  Old wine in a new bottle :)
>>
>> HTH
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>> 
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 7 Apr 2022 at 09:20, Joris Billen 
>> wrote:
>>
>>> Thanks for 

Re: query time comparison to several SQL engines

2022-04-07 Thread James Turton
What might be the biggest factor affecting running time here is that 
Drill's query execution is not fault tolerant while Spark's is.  The 
philosophy is different, Drill's says "when you're doing interactive 
analytics and a node dies, killing your query as it goes, just run the 
query again."


On 2022/04/07 16:11, Wes Peng wrote:


Hi Jacek,

Spark and Drill have no direct relations. But they have the similar 
architecture.


If you read the book "Learning Apache Drill" (I guess it's free 
online), chap 3 will give you Drill's SQL engine architecture:



It's quite similar to Spark's.

And the distributed implementation architecture is almost the same as 
Spark:



Though they are separated products, but have the similar 
implementation IMO.


No, I didn't use a statement optimized for Drill. It's just a common 
SQL statement.


The reason for drill is faster, I think it's b/c drill's direct mmap 
technology. It's more memory consumed than spark, so more faster.


Thanks.


Jacek Laskowski wrote:
Is this true that Drill is Spark or vice versa under the hood? If so, 
how is it possible that Drill is faster? What does Drill do to make 
the query faster? Could this be that you used a type of query Drill 
is optimized for? Just guessing and am really curious (not implying 
that one is better or worse than the other(s)).



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: query time comparison to several SQL engines

2022-04-07 Thread Jacek Laskowski
Hi Wes,

Thanks for the report! I like it (mostly because it's short and concise).
Thank you.

I know nothing about Drill and am curious about the similar execution times
and this sentence in the report: "Spark is the second fastest, that should
be reasonable, since both Spark and Drill have almost the same
implementation architecture.".

Is this true that Drill is Spark or vice versa under the hood? If so, how
is it possible that Drill is faster? What does Drill do to make the query
faster? Could this be that you used a type of query Drill is optimized for?
Just guessing and am really curious (not implying that one is better or
worse than the other(s)).

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books 
Follow me on https://twitter.com/jaceklaskowski




On Thu, Apr 7, 2022 at 1:05 PM Wes Peng  wrote:

> I made a simple test to query time for several SQL engines including
> mysql, hive, drill and spark. The report,
>
> https://cloudcache.net/data/query-time-mysql-hive-drill-spark.pdf
>
> It maybe have no special meaning, just for fun. :)
>
> regards.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Joris Billen
Thanks for pointing this out.

So currently data is stored in hbase on adls. Question (sorry I might be 
ignorant): is it clear that parquet on s3 would be faster as storage to read 
from than hbase on adls?
In general, I ve found it hard after my processing is done, if I have an 
application that needs to read all data from hbase (full large tables) to get 
this as fast as possible.
This read speed is important to me, but it is limited (I think) by the time it 
will take to read the data from the cloud storage (adls).
You can change some parameters (like regionserver heap , block.cache size,  
memstore global size) depending on if you are processing/write a lot OR reading 
from hbase. What I would find really useful if one of these autoscaling systems 
could also optimize these parameters depending if youre reading or writing.



Wrt architecture: indeed separate spark from hbase would be best , but I never 
got it to write from an outside spark cluster.  For autoscaling, I know there 
are hbase cloud offerings that have elastic scaling so indeed that could be an 
improvement too.



ANyhow, fruitful discussion.





On 7 Apr 2022, at 13:46, Bjørn Jørgensen 
mailto:bjornjorgen...@gmail.com>> wrote:

"4. S3: I am not using it, but people in the thread started suggesting 
potential solutions involving s3. It is an azure system, so hbase is stored on 
adls. In fact the nature of my application (geospatial stuff) requires me to 
use geomesa libs, which only allows directly writing from spark to hbase. So I 
can not write to some other format (the geomesa API is not designed for that-it 
only writes directly to hbase using the predetermined key/values)."

In the docs for geomesa it looks like it can write to files. They say to AWS 
which S3 is a part of and " The quick start comes pre-configured to use 
Apache’s Parquet encoding."
http://www.geomesa.org/documentation/current/tutorials/geomesa-quickstart-fsds.html



tor. 7. apr. 2022 kl. 13:30 skrev Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>:
Ok. Your architect has decided to emulate anything on prem to the cloud.You are 
not really taking any advantages of cloud offerings or scalability. For 
example, how does your Hadoop clustercater for the increased capacity. Likewise 
your spark nodes are pigeonholed with your Hadoop nodes.  Old wine in a new 
bottle :)

HTH

 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile

 
https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 7 Apr 2022 at 09:20, Joris Billen 
mailto:joris.bil...@bigindustries.be>> wrote:
Thanks for active discussion and sharing your knowledge :-)


1.Cluster is a managed hadoop cluster on Azure in the cloud. It has hbase, and 
spark, and hdfs shared .
2.Hbase is on the cluster, so not standalone. It comes from an enterprise-level 
template from a commercial vendor, so assuming this is correctly installed.
3.I know that woudl be best to have a spark cluster to do the processing and 
then write to a separate hbase cluster.. but alas :-( somehow we found this to 
be buggy so we have it all on one cluster.
4. S3: I am not using it, but people in the thread started suggesting potential 
solutions involving s3. It is an azure system, so hbase is stored on adls. In 
fact the nature of my application (geospatial stuff) requires 

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Bjørn Jørgensen
"4. S3: I am not using it, but people in the thread started suggesting
potential solutions involving s3. It is an azure system, so hbase is stored
on adls. In fact the nature of my application (geospatial stuff) requires
me to use geomesa libs, which only allows directly writing from spark to
hbase. So I can not write to some other format (the geomesa API is not
designed for that-it only writes directly to hbase using the predetermined
key/values)."

In the docs for geomesa it looks like it can write to files. They say to
AWS which S3 is a part of and " The quick start comes pre-configured to use
Apache’s Parquet encoding."
http://www.geomesa.org/documentation/current/tutorials/geomesa-quickstart-fsds.html



tor. 7. apr. 2022 kl. 13:30 skrev Mich Talebzadeh :

> Ok. Your architect has decided to emulate anything on prem to the
> cloud.You are not really taking any advantages of cloud offerings or
> scalability. For example, how does your Hadoop clustercater for the
> increased capacity. Likewise your spark nodes are pigeonholed with your
> Hadoop nodes.  Old wine in a new bottle :)
>
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 7 Apr 2022 at 09:20, Joris Billen 
> wrote:
>
>> Thanks for active discussion and sharing your knowledge :-)
>>
>>
>> 1.Cluster is a managed hadoop cluster on Azure in the cloud. It has
>> hbase, and spark, and hdfs shared .
>> 2.Hbase is on the cluster, so not standalone. It comes from an
>> enterprise-level template from a commercial vendor, so assuming this is
>> correctly installed.
>> 3.I know that woudl be best to have a spark cluster to do the processing
>> and then write to a separate hbase cluster.. but alas :-( somehow we found
>> this to be buggy so we have it all on one cluster.
>> 4. S3: I am not using it, but people in the thread started suggesting
>> potential solutions involving s3. It is an azure system, so hbase is stored
>> on adls. In fact the nature of my application (geospatial stuff) requires
>> me to use geomesa libs, which only allows directly writing from spark to
>> hbase. So I can not write to some other format (the geomesa API is not
>> designed for that-it only writes directly to hbase using the predetermined
>> key/values).
>>
>> Forgot to mention: I do unpersist my df that was cached.
>>
>> Nevertheless I think I understand the problem now, this discussion is
>> still interesting!
>> So the root cause is : the hbase region server has memory assigned to it
>> (like 20GB). I see when I start writing from spark to hbase, not much of
>> this is used. I have loops of processing 1 day in spark. For each loop, the
>> regionserver heap is filled a bit more. Since I also overcommitted memory
>> in my cluster (have used in the setup more than really is available), tfter
>> several loops it starts to use more and more of the 20GB and eventually the
>> overall cluster starts to  hit the memory that is available on the workers.
>> The solution is to lower the hbase regionserver heap memory, so Im not
>> overcommitted anymore. In fact, high regionserver memory is more important
>> when I read my data, since then it helps a lot to cache data and to have
>> faster reads. For writing it is not important to have such a high value.
>>
>>
>> Thanks,
>> Joris
>>
>>
>> On 7 Apr 2022, at 09:26, Mich Talebzadeh 
>> wrote:
>>
>> Ok so that is your assumption. The whole thing is based on-premise on
>> JBOD (including hadoop cluster which has Spark binaries on each node as I
>> understand) as I understand. But it will be faster to use S3 (or GCS)
>> through some network and it will be faster than writing to the local SSD. I
>> don't understand the point here.
>>
>> Also it appears the thread owner is talking about having HBase on Hadoop
>> cluster on some node eating memory.  This can be easily sorted by moving
>> HBase to its own cluster, which will ease up Hadoop, Spark and HBase
>> competing for resources. It is possible that the issue is with HBase setup
>> as well.
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>> 

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Mich Talebzadeh
Ok. Your architect has decided to emulate anything on prem to the cloud.You
are not really taking any advantages of cloud offerings or scalability. For
example, how does your Hadoop clustercater for the increased capacity.
Likewise your spark nodes are pigeonholed with your Hadoop nodes.  Old wine
in a new bottle :)


HTH


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 7 Apr 2022 at 09:20, Joris Billen 
wrote:

> Thanks for active discussion and sharing your knowledge :-)
>
>
> 1.Cluster is a managed hadoop cluster on Azure in the cloud. It has hbase,
> and spark, and hdfs shared .
> 2.Hbase is on the cluster, so not standalone. It comes from an
> enterprise-level template from a commercial vendor, so assuming this is
> correctly installed.
> 3.I know that woudl be best to have a spark cluster to do the processing
> and then write to a separate hbase cluster.. but alas :-( somehow we found
> this to be buggy so we have it all on one cluster.
> 4. S3: I am not using it, but people in the thread started suggesting
> potential solutions involving s3. It is an azure system, so hbase is stored
> on adls. In fact the nature of my application (geospatial stuff) requires
> me to use geomesa libs, which only allows directly writing from spark to
> hbase. So I can not write to some other format (the geomesa API is not
> designed for that-it only writes directly to hbase using the predetermined
> key/values).
>
> Forgot to mention: I do unpersist my df that was cached.
>
> Nevertheless I think I understand the problem now, this discussion is
> still interesting!
> So the root cause is : the hbase region server has memory assigned to it
> (like 20GB). I see when I start writing from spark to hbase, not much of
> this is used. I have loops of processing 1 day in spark. For each loop, the
> regionserver heap is filled a bit more. Since I also overcommitted memory
> in my cluster (have used in the setup more than really is available), tfter
> several loops it starts to use more and more of the 20GB and eventually the
> overall cluster starts to  hit the memory that is available on the workers.
> The solution is to lower the hbase regionserver heap memory, so Im not
> overcommitted anymore. In fact, high regionserver memory is more important
> when I read my data, since then it helps a lot to cache data and to have
> faster reads. For writing it is not important to have such a high value.
>
>
> Thanks,
> Joris
>
>
> On 7 Apr 2022, at 09:26, Mich Talebzadeh 
> wrote:
>
> Ok so that is your assumption. The whole thing is based on-premise on JBOD
> (including hadoop cluster which has Spark binaries on each node as I
> understand) as I understand. But it will be faster to use S3 (or GCS)
> through some network and it will be faster than writing to the local SSD. I
> don't understand the point here.
>
> Also it appears the thread owner is talking about having HBase on Hadoop
> cluster on some node eating memory.  This can be easily sorted by moving
> HBase to its own cluster, which will ease up Hadoop, Spark and HBase
> competing for resources. It is possible that the issue is with HBase setup
> as well.
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 7 Apr 2022 at 08:11, Bjørn Jørgensen 
> wrote:
>
>>
>>1. Where does S3 come into this
>>
>> He is processing data for each day at a time. So to dump 

query time comparison to several SQL engines

2022-04-07 Thread Wes Peng
I made a simple test to query time for several SQL engines including 
mysql, hive, drill and spark. The report,


https://cloudcache.net/data/query-time-mysql-hive-drill-spark.pdf

It maybe have no special meaning, just for fun. :)

regards.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Bjørn Jørgensen
"But it will be faster to use S3 (or GCS) through some network and it will
be faster than writing to the local SSD. I don't understand the point
here."
Minio is a S3 mock, so you run minio local.

tor. 7. apr. 2022 kl. 09:27 skrev Mich Talebzadeh :

> Ok so that is your assumption. The whole thing is based on-premise on JBOD
> (including hadoop cluster which has Spark binaries on each node as I
> understand) as I understand. But it will be faster to use S3 (or GCS)
> through some network and it will be faster than writing to the local SSD. I
> don't understand the point here.
>
> Also it appears the thread owner is talking about having HBase on Hadoop
> cluster on some node eating memory.  This can be easily sorted by moving
> HBase to its own cluster, which will ease up Hadoop, Spark and HBase
> competing for resources. It is possible that the issue is with HBase setup
> as well.
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 7 Apr 2022 at 08:11, Bjørn Jørgensen 
> wrote:
>
>>
>>1. Where does S3 come into this
>>
>> He is processing data for each day at a time. So to dump each day to a
>> fast storage he can use parquet files and write it to S3.
>>
>> ons. 6. apr. 2022 kl. 22:27 skrev Mich Talebzadeh <
>> mich.talebza...@gmail.com>:
>>
>>>
>>> Your statement below:
>>>
>>>
>>> I believe I have found the issue: the job writes data to hbase which is
>>> on the same cluster.
>>> When I keep on processing data and writing with spark to hbase ,
>>> eventually the garbage collection can not keep up anymore for hbase, and
>>> the hbase memory consumption increases. As the clusters hosts both hbase
>>> and spark, this leads to an overall increase and at some point you hit the
>>> limit of the available memory on each worker.
>>> I dont think the spark memory is increasing over time.
>>>
>>>
>>>1. Where is your cluster on Prem? Do you Have a Hadoop cluster
>>>with spark using the same nodes as HDFS?
>>>2. Is your Hbase clustered or standalone and has been created on
>>>HDFS nodes
>>>3. Are you writing to Hbase through phoenix or straight to HBase
>>>4. Where does S3 come into this
>>>
>>>
>>> HTH
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Wed, 6 Apr 2022 at 16:41, Joris Billen 
>>> wrote:
>>>
 HI,
 thanks for your reply.


 I believe I have found the issue: the job writes data to hbase which is
 on the same cluster.
 When I keep on processing data and writing with spark to hbase ,
 eventually the garbage collection can not keep up anymore for hbase, and
 the hbase memory consumption increases. As the clusters hosts both hbase
 and spark, this leads to an overall increase and at some point you hit the
 limit of the available memory on each worker.
 I dont think the spark memory is increasing over time.



 Here more details:

 **Spark: 2.4
 **operation: many spark sql statements followed by writing data to a
 nosql db from spark
 like this:
 df=read(fromhdfs)
 df2=spark.sql(using df 1)
 ..df10=spark.sql(using df9)
 spark.sql(CACHE TABLE df10)
 df11 =spark.sql(using df10)
 df11.write
 Df12 =spark.sql(using df10)
 df12.write
 df13 =spark.sql(using df10)
 df13.write
 **caching: yes one df that I will use to eventually write 3 x to a db
 (those 3 are different)
 **Loops: since I need to process several years, and processing 1 day is
 already a complex process (40 minutes on 9 node cluster running quite a bit
 of executors). So in the end it will do all at one go and there is a limit
 of how much data I can process in one go with the available resources.
 Some people here pointed out they believe this looping should not be
 necessary. But what is the alternative?
 —> Maybe I can write to disk somewhere in the middle, and read again
 from there so that in the end not all must happen in one go in memory.







 On 5 Apr 2022, at 14:58, Gourav 

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Joris Billen
Thanks for active discussion and sharing your knowledge :-)


1.Cluster is a managed hadoop cluster on Azure in the cloud. It has hbase, and 
spark, and hdfs shared .
2.Hbase is on the cluster, so not standalone. It comes from an enterprise-level 
template from a commercial vendor, so assuming this is correctly installed.
3.I know that woudl be best to have a spark cluster to do the processing and 
then write to a separate hbase cluster.. but alas :-( somehow we found this to 
be buggy so we have it all on one cluster.
4. S3: I am not using it, but people in the thread started suggesting potential 
solutions involving s3. It is an azure system, so hbase is stored on adls. In 
fact the nature of my application (geospatial stuff) requires me to use geomesa 
libs, which only allows directly writing from spark to hbase. So I can not 
write to some other format (the geomesa API is not designed for that-it only 
writes directly to hbase using the predetermined key/values).

Forgot to mention: I do unpersist my df that was cached.

Nevertheless I think I understand the problem now, this discussion is still 
interesting!
So the root cause is : the hbase region server has memory assigned to it (like 
20GB). I see when I start writing from spark to hbase, not much of this is 
used. I have loops of processing 1 day in spark. For each loop, the 
regionserver heap is filled a bit more. Since I also overcommitted memory in my 
cluster (have used in the setup more than really is available), tfter several 
loops it starts to use more and more of the 20GB and eventually the overall 
cluster starts to  hit the memory that is available on the workers. The 
solution is to lower the hbase regionserver heap memory, so Im not 
overcommitted anymore. In fact, high regionserver memory is more important when 
I read my data, since then it helps a lot to cache data and to have faster 
reads. For writing it is not important to have such a high value.


Thanks,
Joris


On 7 Apr 2022, at 09:26, Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:

Ok so that is your assumption. The whole thing is based on-premise on JBOD 
(including hadoop cluster which has Spark binaries on each node as I 
understand) as I understand. But it will be faster to use S3 (or GCS) through 
some network and it will be faster than writing to the local SSD. I don't 
understand the point here.

Also it appears the thread owner is talking about having HBase on Hadoop 
cluster on some node eating memory.  This can be easily sorted by moving HBase 
to its own cluster, which will ease up Hadoop, Spark and HBase competing for 
resources. It is possible that the issue is with HBase setup as well.

HTH


 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile

 
https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 7 Apr 2022 at 08:11, Bjørn Jørgensen 
mailto:bjornjorgen...@gmail.com>> wrote:

  1.  Where does S3 come into this

He is processing data for each day at a time. So to dump each day to a fast 
storage he can use parquet files and write it to S3.

ons. 6. apr. 2022 kl. 22:27 skrev Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>:

Your statement below:

I believe I have found the issue: the job writes data to hbase which is on the 
same cluster.
When I keep on processing data and writing with spark to hbase , eventually the 
garbage collection can not keep up anymore for hbase, and the hbase memory 
consumption increases. As the clusters hosts both hbase and spark, this leads 
to an overall increase and at some point you hit the limit of the available 
memory on each worker.
I dont think the spark memory is increasing over time.


  1.  Where is your cluster on Prem? Do you Have a Hadoop cluster with spark 
using the 

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Mich Talebzadeh
Ok so that is your assumption. The whole thing is based on-premise on JBOD
(including hadoop cluster which has Spark binaries on each node as I
understand) as I understand. But it will be faster to use S3 (or GCS)
through some network and it will be faster than writing to the local SSD. I
don't understand the point here.

Also it appears the thread owner is talking about having HBase on Hadoop
cluster on some node eating memory.  This can be easily sorted by moving
HBase to its own cluster, which will ease up Hadoop, Spark and HBase
competing for resources. It is possible that the issue is with HBase setup
as well.

HTH



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 7 Apr 2022 at 08:11, Bjørn Jørgensen 
wrote:

>
>1. Where does S3 come into this
>
> He is processing data for each day at a time. So to dump each day to a
> fast storage he can use parquet files and write it to S3.
>
> ons. 6. apr. 2022 kl. 22:27 skrev Mich Talebzadeh <
> mich.talebza...@gmail.com>:
>
>>
>> Your statement below:
>>
>>
>> I believe I have found the issue: the job writes data to hbase which is
>> on the same cluster.
>> When I keep on processing data and writing with spark to hbase ,
>> eventually the garbage collection can not keep up anymore for hbase, and
>> the hbase memory consumption increases. As the clusters hosts both hbase
>> and spark, this leads to an overall increase and at some point you hit the
>> limit of the available memory on each worker.
>> I dont think the spark memory is increasing over time.
>>
>>
>>1. Where is your cluster on Prem? Do you Have a Hadoop cluster
>>with spark using the same nodes as HDFS?
>>2. Is your Hbase clustered or standalone and has been created on HDFS
>>nodes
>>3. Are you writing to Hbase through phoenix or straight to HBase
>>4. Where does S3 come into this
>>
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 6 Apr 2022 at 16:41, Joris Billen 
>> wrote:
>>
>>> HI,
>>> thanks for your reply.
>>>
>>>
>>> I believe I have found the issue: the job writes data to hbase which is
>>> on the same cluster.
>>> When I keep on processing data and writing with spark to hbase ,
>>> eventually the garbage collection can not keep up anymore for hbase, and
>>> the hbase memory consumption increases. As the clusters hosts both hbase
>>> and spark, this leads to an overall increase and at some point you hit the
>>> limit of the available memory on each worker.
>>> I dont think the spark memory is increasing over time.
>>>
>>>
>>>
>>> Here more details:
>>>
>>> **Spark: 2.4
>>> **operation: many spark sql statements followed by writing data to a
>>> nosql db from spark
>>> like this:
>>> df=read(fromhdfs)
>>> df2=spark.sql(using df 1)
>>> ..df10=spark.sql(using df9)
>>> spark.sql(CACHE TABLE df10)
>>> df11 =spark.sql(using df10)
>>> df11.write
>>> Df12 =spark.sql(using df10)
>>> df12.write
>>> df13 =spark.sql(using df10)
>>> df13.write
>>> **caching: yes one df that I will use to eventually write 3 x to a db
>>> (those 3 are different)
>>> **Loops: since I need to process several years, and processing 1 day is
>>> already a complex process (40 minutes on 9 node cluster running quite a bit
>>> of executors). So in the end it will do all at one go and there is a limit
>>> of how much data I can process in one go with the available resources.
>>> Some people here pointed out they believe this looping should not be
>>> necessary. But what is the alternative?
>>> —> Maybe I can write to disk somewhere in the middle, and read again
>>> from there so that in the end not all must happen in one go in memory.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 5 Apr 2022, at 14:58, Gourav Sengupta 
>>> wrote:
>>>
>>> Hi,
>>>
>>> can you please give details around:
>>> spark version, what is the operation that you are running, why in loops,
>>> and whether you are caching in any data or not, and whether you are
>>> referencing the variables to create them like in the following expression
>>> we are referencing x to create x, x = x + 1
>>>
>>> Thanks and Regards,
>>> Gourav Sengupta
>>>
>>> 

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Bjørn Jørgensen
   1. Where does S3 come into this

He is processing data for each day at a time. So to dump each day to a fast
storage he can use parquet files and write it to S3.

ons. 6. apr. 2022 kl. 22:27 skrev Mich Talebzadeh :

>
> Your statement below:
>
>
> I believe I have found the issue: the job writes data to hbase which is on
> the same cluster.
> When I keep on processing data and writing with spark to hbase ,
> eventually the garbage collection can not keep up anymore for hbase, and
> the hbase memory consumption increases. As the clusters hosts both hbase
> and spark, this leads to an overall increase and at some point you hit the
> limit of the available memory on each worker.
> I dont think the spark memory is increasing over time.
>
>
>1. Where is your cluster on Prem? Do you Have a Hadoop cluster
>with spark using the same nodes as HDFS?
>2. Is your Hbase clustered or standalone and has been created on HDFS
>nodes
>3. Are you writing to Hbase through phoenix or straight to HBase
>4. Where does S3 come into this
>
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 6 Apr 2022 at 16:41, Joris Billen 
> wrote:
>
>> HI,
>> thanks for your reply.
>>
>>
>> I believe I have found the issue: the job writes data to hbase which is
>> on the same cluster.
>> When I keep on processing data and writing with spark to hbase ,
>> eventually the garbage collection can not keep up anymore for hbase, and
>> the hbase memory consumption increases. As the clusters hosts both hbase
>> and spark, this leads to an overall increase and at some point you hit the
>> limit of the available memory on each worker.
>> I dont think the spark memory is increasing over time.
>>
>>
>>
>> Here more details:
>>
>> **Spark: 2.4
>> **operation: many spark sql statements followed by writing data to a
>> nosql db from spark
>> like this:
>> df=read(fromhdfs)
>> df2=spark.sql(using df 1)
>> ..df10=spark.sql(using df9)
>> spark.sql(CACHE TABLE df10)
>> df11 =spark.sql(using df10)
>> df11.write
>> Df12 =spark.sql(using df10)
>> df12.write
>> df13 =spark.sql(using df10)
>> df13.write
>> **caching: yes one df that I will use to eventually write 3 x to a db
>> (those 3 are different)
>> **Loops: since I need to process several years, and processing 1 day is
>> already a complex process (40 minutes on 9 node cluster running quite a bit
>> of executors). So in the end it will do all at one go and there is a limit
>> of how much data I can process in one go with the available resources.
>> Some people here pointed out they believe this looping should not be
>> necessary. But what is the alternative?
>> —> Maybe I can write to disk somewhere in the middle, and read again from
>> there so that in the end not all must happen in one go in memory.
>>
>>
>>
>>
>>
>>
>>
>> On 5 Apr 2022, at 14:58, Gourav Sengupta 
>> wrote:
>>
>> Hi,
>>
>> can you please give details around:
>> spark version, what is the operation that you are running, why in loops,
>> and whether you are caching in any data or not, and whether you are
>> referencing the variables to create them like in the following expression
>> we are referencing x to create x, x = x + 1
>>
>> Thanks and Regards,
>> Gourav Sengupta
>>
>> On Mon, Apr 4, 2022 at 10:51 AM Joris Billen <
>> joris.bil...@bigindustries.be> wrote:
>>
>>> Clear-probably not a good idea.
>>>
>>> But a previous comment said “you are doing everything in the end in one
>>> go”.
>>> So this made me wonder: in case your only action is a write in the end
>>> after lots of complex transformations, then what is the alternative for
>>> writing in the end which means doing everything all at once in the end? My
>>> understanding is that if there is no need for an action earlier, you will
>>> do all at the end, which means there is a limitation to how many days you
>>> can process at once. And hence the solution is to loop over a couple days,
>>> and submit always the same spark job just for other input.
>>>
>>>
>>> Thanks!
>>>
>>> On 1 Apr 2022, at 15:26, Sean Owen  wrote:
>>>
>>> This feels like premature optimization, and not clear it's optimizing,
>>> but maybe.
>>> Caching things that are used once is worse than not caching. It looks
>>> like a straight-line through to the write, so I doubt caching helps
>>> anything here.
>>>
>>> On Fri, Apr 1, 2022 at 2:49 AM Joris Billen <
>>> joris.bil...@bigindustries.be> wrote:
>>>
 Hi,
 as said thanks for little discussion over mail.
 I understand that the action is triggered in the end at the write