Re: Spark Thriftserver is failing for when submitting command from beeline

2021-08-20 Thread Artemis User
Looks like your problem is related to not setting up a hive.xml file 
properly.  The standard Spark distribution doesn't include a hive.xml 
template file in the conf directory.  You will have to create one by 
yourself.  Please refer to the Spark user doc and Hive metastore config 
guide for details...


-- ND

On 8/20/21 9:50 AM, Pralabh Kumar wrote:

Hi Dev

Environment details

Hadoop 3.2
Hive 3.1
Spark 3.0.3

Cluster : Kerborized .

1) Hive server is running fine
2) Spark sql , sparkshell, spark submit everything is working as expected.
3) Connecting Hive through beeline is working fine (after kinit)
beeline -u "jdbc:hive2://:/default;principal=part principal>


Now launched Spark thrift server and try to connect it through beeline.

beeline client perfectly connects with STS .

4) beeline -u "jdbc:hive2://:/default;principal=part principal>

   a) Log says connected to
       Spark sql
       Drive : Hive JDBC


Now when I run any commands ("show tables") it fails .  Log ins STS  says

*21/08/19 19:30:12 DEBUG UserGroupInformation: PrivilegedAction 
as: (auth:PROXY) via  (auth:KERBEROS) 
from:org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Client.createClientTransport(HadoopThriftAuthBridge.java:208)*
*21/08/19 19:30:12 DEBUG UserGroupInformation: PrivilegedAction as:* 
** * (auth:PROXY) via * ** 
* (auth:KERBEROS) 
from:org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)*
21/08/19 19:30:12 DEBUG TSaslTransport: opening transport 
org.apache.thrift.transport.TSaslClientTransport@f43fd2f

21/08/19 19:30:12 ERROR TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: GSS initiate failed [Caused by 
GSSException: No valid credentials provided (Mechanism level: Failed 
to find any Kerberos tgt)]
at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
at 
org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:95)
at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
at 
org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:38)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)

at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:480)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:247)
at 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1707)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:83)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:133)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
at 
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3600)

at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3652)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3632)
at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1556)
at org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1545)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$databaseExists$1(HiveClientImpl.scala:384)




My guess is authorization through proxy is not working .



Please help


Regards
Pralabh Kumar





Re: Is memory-only no-disk Spark possible?

2021-08-20 Thread Mich Talebzadeh
Thanks

Sounds like you experimented on prem with HDFS and Spark using the same
host nodes with data affinity. I am not sure it is something I can sell in
a banking environment so to speak. Bottom line it will boil down to
procuring more tin boxes on -prem to give spark more memory, assuming that
is needed.

>From my experience configuring Spark and allocating enough yarn memory will
be a better option in the long term. However, this is now all water under
the bridge so to speak as we are now moving to Cloud, Spark on Kubernetes
and all that.

Mich



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 20 Aug 2021 at 16:37, Bobby Evans  wrote:

> Yes, this is very much "use at your own risk".  That said at Yahoo we did
> something very similar to this on all of the YARN nodes and saw a decent
> performance uplift.  This was even with HDFS running on the same nodes. I
> think we just changed the time to flush to 30 mins, but it was a long time
> ago so the details are a bit fuzzy. Spark will guarantee that the data has
> been handed off to the OS. The OS should not lose the data unless there is
> a bug in the OS or the entire node crashes.  If the node crashes all the
> containers will go down with it.  YARN has no way to recover a crashed
> container, it just reruns it from the beginning, so it does not matter if
> the data is lost when it crashes.
>
> I don't remember all of the details for what we did around software
> upgrade or RAID setups for the root/os partitions. You need to be careful
> if you are modifying anything that is not ephemeral on the node, but most
> things should be ephemeral in production on these types of systems. You
> might mess up the file system a bit on a crash and need to run fsk to
> recover, but we had that automated and most groups now run on VMs anyways
> so throw away the old VM and start over.  Especially if crashes are rare,
> and you might have a lot of memory sitting idle.
>
> On Fri, Aug 20, 2021 at 10:05 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi Bobby,
>>
>> On this statement of yours if I may:
>>
>> ... If you really want to you can configure the pagecache to not spill to
>> disk until absolutely necessary. That should get you really close to pure
>> in-memory processing, so long as you have enough free memory on the host to
>> support it.
>>
>> I would not particularly recommend that, bearing in mind that as we are
>> dealing with edge cases, in case of error recovering from edge cases can be
>> more costly than using disk space.
>>
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 20 Aug 2021 at 15:28, Bobby Evans  wrote:
>>
>>> On the data path, Spark will write to a local disk when it runs out of
>>> memory and needs to spill or when doing a shuffle with the default shuffle
>>> implementation.  The spilling is a good thing because it lets you process
>>> data that is too large to fit in memory.  It is not great because the
>>> processing slows down a lot when that happens, but slow is better than
>>> crashing in many cases. The default shuffle implementation will
>>> always write out to disk.  This again is good in that it allows you to
>>> process more data on a single box than can fit in memory. It is bad when
>>> the shuffle data could fit in memory, but ends up being written to disk
>>> anyways.  On Linux the data is being written into the page cache and will
>>> be flushed to disk in the background when memory is needed or after a set
>>> amount of time. If your query is fast and is shuffling little data, then it
>>> is likely that your query is running all in memory.  All of the shuffle
>>> reads and writes are probably going directly to the page cache and the disk
>>> is not involved at all. If you really want to you can configure the
>>> pagecache to not spill to disk until absolutely necessary. That should get
>>> you really close to pure in-memory processing, so long as you have enough
>>> free memory on the host to support it.
>>>
>>> Bobby
>>>
>>>
>>>
>>> On Fri, Aug 20, 2021 at 7:57 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Well I don't know what having an "in-memory Spark only" is going to
 

How can I use sparkContext.addFile

2021-08-20 Thread igyu
in spark-shell
I can run 

val url = "hdfs://nameservice1/user/jztwk/config.json"
Spark.sparkContext.addFile(url)
val json_str = readLocalFile(SparkFiles.get(url.split("/").last))

but when I make jar package

spark-submit --master yarn --deploy-mode cluster --principal 
jztwk/had...@join.com --keytab /hadoop/app/jztwk.keytab --class 
com.join.Synctool --jars hdfs://nameservice1/sparklib/* 
jztsynctools-1.0-SNAPSHOT.jar

I get a error

 ERROR yarn.Client: Application diagnostics message: User class threw 
exception: java.io.FileNotFoundException: 
/hadoop/yarn/nm1/usercache/jztwk/appcache/application_1627287887991_0571/spark-020a769c-6d9c-42ff-9bb2-1407cf6ed0bc/userFiles-1f57a3ed-22fa-4464-84e4-e549685b0d2d/hadoop/yarn/nm1/usercache/jztwk/appcache/application_1627287887991_0571/spark-020a769c-6d9c-42ff-9bb2-1407cf6ed0bc/userFiles-1f57a3ed-22fa-4464-84e4-e549685b0d2d/config.json
 (No such file or directory)




but 



igyu


Re: Is memory-only no-disk Spark possible?

2021-08-20 Thread Bobby Evans
Yes, this is very much "use at your own risk".  That said at Yahoo we did
something very similar to this on all of the YARN nodes and saw a decent
performance uplift.  This was even with HDFS running on the same nodes. I
think we just changed the time to flush to 30 mins, but it was a long time
ago so the details are a bit fuzzy. Spark will guarantee that the data has
been handed off to the OS. The OS should not lose the data unless there is
a bug in the OS or the entire node crashes.  If the node crashes all the
containers will go down with it.  YARN has no way to recover a crashed
container, it just reruns it from the beginning, so it does not matter if
the data is lost when it crashes.

I don't remember all of the details for what we did around software
upgrade or RAID setups for the root/os partitions. You need to be careful
if you are modifying anything that is not ephemeral on the node, but most
things should be ephemeral in production on these types of systems. You
might mess up the file system a bit on a crash and need to run fsk to
recover, but we had that automated and most groups now run on VMs anyways
so throw away the old VM and start over.  Especially if crashes are rare,
and you might have a lot of memory sitting idle.

On Fri, Aug 20, 2021 at 10:05 AM Mich Talebzadeh 
wrote:

> Hi Bobby,
>
> On this statement of yours if I may:
>
> ... If you really want to you can configure the pagecache to not spill to
> disk until absolutely necessary. That should get you really close to pure
> in-memory processing, so long as you have enough free memory on the host to
> support it.
>
> I would not particularly recommend that, bearing in mind that as we are
> dealing with edge cases, in case of error recovering from edge cases can be
> more costly than using disk space.
>
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 20 Aug 2021 at 15:28, Bobby Evans  wrote:
>
>> On the data path, Spark will write to a local disk when it runs out of
>> memory and needs to spill or when doing a shuffle with the default shuffle
>> implementation.  The spilling is a good thing because it lets you process
>> data that is too large to fit in memory.  It is not great because the
>> processing slows down a lot when that happens, but slow is better than
>> crashing in many cases. The default shuffle implementation will
>> always write out to disk.  This again is good in that it allows you to
>> process more data on a single box than can fit in memory. It is bad when
>> the shuffle data could fit in memory, but ends up being written to disk
>> anyways.  On Linux the data is being written into the page cache and will
>> be flushed to disk in the background when memory is needed or after a set
>> amount of time. If your query is fast and is shuffling little data, then it
>> is likely that your query is running all in memory.  All of the shuffle
>> reads and writes are probably going directly to the page cache and the disk
>> is not involved at all. If you really want to you can configure the
>> pagecache to not spill to disk until absolutely necessary. That should get
>> you really close to pure in-memory processing, so long as you have enough
>> free memory on the host to support it.
>>
>> Bobby
>>
>>
>>
>> On Fri, Aug 20, 2021 at 7:57 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Well I don't know what having an "in-memory Spark only" is going to
>>> achieve. Spark GUI shows the amount of disk usage pretty well. The memory
>>> is used exclusively by default first.
>>>
>>> Spark is no different from a predominantly in-memory application.
>>> Effectively it is doing the classical disk based hadoop  map-reduce
>>> operation "in memory" to speed up the processing but it is still an
>>> application on top of the OS.  So like mose applications, there is a state
>>> of Spark, the code running and the OS(s), where disk usage will be needed.
>>>
>>> This is akin to swap space on OS itself and I quote "Swap space is used when
>>> your operating system decides that it needs physical memory for active
>>> processes and the amount of available (unused) physical memory is
>>> insufficient. When this happens, inactive pages from the physical
>>> memory are then moved into the swap space, freeing up that physical memory
>>> for other uses"
>>>
>>>  free
>>>   totalusedfree  shared  buff/cache
>>>  available
>>> Mem:   6565973230116700 1429436 234177234113596
>>> 32665372
>>> Swap: 104857596  550912   104306684
>>>
>>> HTH
>>>
>>>
>>>view my Linkedin 

Re: Is memory-only no-disk Spark possible?

2021-08-20 Thread Mich Talebzadeh
Hi Bobby,

On this statement of yours if I may:

... If you really want to you can configure the pagecache to not spill to
disk until absolutely necessary. That should get you really close to pure
in-memory processing, so long as you have enough free memory on the host to
support it.

I would not particularly recommend that, bearing in mind that as we are
dealing with edge cases, in case of error recovering from edge cases can be
more costly than using disk space.


HTH


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 20 Aug 2021 at 15:28, Bobby Evans  wrote:

> On the data path, Spark will write to a local disk when it runs out of
> memory and needs to spill or when doing a shuffle with the default shuffle
> implementation.  The spilling is a good thing because it lets you process
> data that is too large to fit in memory.  It is not great because the
> processing slows down a lot when that happens, but slow is better than
> crashing in many cases. The default shuffle implementation will
> always write out to disk.  This again is good in that it allows you to
> process more data on a single box than can fit in memory. It is bad when
> the shuffle data could fit in memory, but ends up being written to disk
> anyways.  On Linux the data is being written into the page cache and will
> be flushed to disk in the background when memory is needed or after a set
> amount of time. If your query is fast and is shuffling little data, then it
> is likely that your query is running all in memory.  All of the shuffle
> reads and writes are probably going directly to the page cache and the disk
> is not involved at all. If you really want to you can configure the
> pagecache to not spill to disk until absolutely necessary. That should get
> you really close to pure in-memory processing, so long as you have enough
> free memory on the host to support it.
>
> Bobby
>
>
>
> On Fri, Aug 20, 2021 at 7:57 AM Mich Talebzadeh 
> wrote:
>
>> Well I don't know what having an "in-memory Spark only" is going to
>> achieve. Spark GUI shows the amount of disk usage pretty well. The memory
>> is used exclusively by default first.
>>
>> Spark is no different from a predominantly in-memory application.
>> Effectively it is doing the classical disk based hadoop  map-reduce
>> operation "in memory" to speed up the processing but it is still an
>> application on top of the OS.  So like mose applications, there is a state
>> of Spark, the code running and the OS(s), where disk usage will be needed.
>>
>> This is akin to swap space on OS itself and I quote "Swap space is used when
>> your operating system decides that it needs physical memory for active
>> processes and the amount of available (unused) physical memory is
>> insufficient. When this happens, inactive pages from the physical memory
>> are then moved into the swap space, freeing up that physical memory for
>> other uses"
>>
>>  free
>>   totalusedfree  shared  buff/cache
>>  available
>> Mem:   6565973230116700 1429436 234177234113596
>> 32665372
>> Swap: 104857596  550912   104306684
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 20 Aug 2021 at 12:50, Jacek Laskowski  wrote:
>>
>>> Hi,
>>>
>>> I've been exploring BlockManager and the stores for a while now and am
>>> tempted to say that a memory-only Spark setup would be possible (except
>>> shuffle blocks). Is this correct?
>>>
>>> What about shuffle blocks? Do they have to be stored on disk (in
>>> DiskStore)?
>>>
>>> I think broadcast variables are in-memory first so except on-disk
>>> storage level explicitly used (by Spark devs), there's no reason not to
>>> have Spark in-memory only.
>>>
>>> (I was told that one of the differences between Trino/Presto vs Spark
>>> SQL is that Trino keeps all processing in-memory only and will blow up
>>> while Spark uses disk to avoid OOMEs).
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> https://about.me/JacekLaskowski
>>> "The Internals Of" Online Books 
>>> Follow me on https://twitter.com/jaceklaskowski
>>>
>>> 
>>>
>>


Re: Is memory-only no-disk Spark possible?

2021-08-20 Thread Bobby Evans
On the data path, Spark will write to a local disk when it runs out of
memory and needs to spill or when doing a shuffle with the default shuffle
implementation.  The spilling is a good thing because it lets you process
data that is too large to fit in memory.  It is not great because the
processing slows down a lot when that happens, but slow is better than
crashing in many cases. The default shuffle implementation will
always write out to disk.  This again is good in that it allows you to
process more data on a single box than can fit in memory. It is bad when
the shuffle data could fit in memory, but ends up being written to disk
anyways.  On Linux the data is being written into the page cache and will
be flushed to disk in the background when memory is needed or after a set
amount of time. If your query is fast and is shuffling little data, then it
is likely that your query is running all in memory.  All of the shuffle
reads and writes are probably going directly to the page cache and the disk
is not involved at all. If you really want to you can configure the
pagecache to not spill to disk until absolutely necessary. That should get
you really close to pure in-memory processing, so long as you have enough
free memory on the host to support it.

Bobby



On Fri, Aug 20, 2021 at 7:57 AM Mich Talebzadeh 
wrote:

> Well I don't know what having an "in-memory Spark only" is going to
> achieve. Spark GUI shows the amount of disk usage pretty well. The memory
> is used exclusively by default first.
>
> Spark is no different from a predominantly in-memory application.
> Effectively it is doing the classical disk based hadoop  map-reduce
> operation "in memory" to speed up the processing but it is still an
> application on top of the OS.  So like mose applications, there is a state
> of Spark, the code running and the OS(s), where disk usage will be needed.
>
> This is akin to swap space on OS itself and I quote "Swap space is used when
> your operating system decides that it needs physical memory for active
> processes and the amount of available (unused) physical memory is
> insufficient. When this happens, inactive pages from the physical memory
> are then moved into the swap space, freeing up that physical memory for
> other uses"
>
>  free
>   totalusedfree  shared  buff/cache
>  available
> Mem:   6565973230116700 1429436 234177234113596
> 32665372
> Swap: 104857596  550912   104306684
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 20 Aug 2021 at 12:50, Jacek Laskowski  wrote:
>
>> Hi,
>>
>> I've been exploring BlockManager and the stores for a while now and am
>> tempted to say that a memory-only Spark setup would be possible (except
>> shuffle blocks). Is this correct?
>>
>> What about shuffle blocks? Do they have to be stored on disk (in
>> DiskStore)?
>>
>> I think broadcast variables are in-memory first so except on-disk storage
>> level explicitly used (by Spark devs), there's no reason not to have Spark
>> in-memory only.
>>
>> (I was told that one of the differences between Trino/Presto vs Spark SQL
>> is that Trino keeps all processing in-memory only and will blow up while
>> Spark uses disk to avoid OOMEs).
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> "The Internals Of" Online Books 
>> Follow me on https://twitter.com/jaceklaskowski
>>
>> 
>>
>


Spark Thriftserver is failing for when submitting command from beeline

2021-08-20 Thread Pralabh Kumar
Hi Dev

Environment details

Hadoop 3.2
Hive 3.1
Spark 3.0.3

Cluster : Kerborized .

1) Hive server is running fine
2) Spark sql , sparkshell, spark submit everything is working as expected.
3) Connecting Hive through beeline is working fine (after kinit)
beeline -u "jdbc:hive2://:/default;principal=

Now launched Spark thrift server and try to connect it through beeline.

beeline client perfectly connects with STS .

4) beeline -u "jdbc:hive2://:/default;principal=
   a) Log says connected to
   Spark sql
   Drive : Hive JDBC


Now when I run any commands ("show tables") it fails .  Log ins STS  says

*21/08/19 19:30:12 DEBUG UserGroupInformation: PrivilegedAction as:
(auth:PROXY) via  (auth:KERBEROS)
from:org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Client.createClientTransport(HadoopThriftAuthBridge.java:208)*
*21/08/19 19:30:12 DEBUG UserGroupInformation: PrivilegedAction as:*
**  * (auth:PROXY) via * **  * (auth:KERBEROS)
from:org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)*
21/08/19 19:30:12 DEBUG TSaslTransport: opening transport
org.apache.thrift.transport.TSaslClientTransport@f43fd2f
21/08/19 19:30:12 ERROR TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: GSS initiate failed [Caused by
GSSException: No valid credentials provided (Mechanism level: Failed to
find any Kerberos tgt)]
at
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
at
org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:95)
at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
at
org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:38)
at
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
at
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:480)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:247)
at
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1707)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:83)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:133)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
at
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3600)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3652)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3632)
at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1556)
at org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1545)
at
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$databaseExists$1(HiveClientImpl.scala:384)




My guess is authorization through proxy is not working .



Please help


Regards
Pralabh Kumar


Re: Is memory-only no-disk Spark possible?

2021-08-20 Thread Mich Talebzadeh
Well I don't know what having an "in-memory Spark only" is going to
achieve. Spark GUI shows the amount of disk usage pretty well. The memory
is used exclusively by default first.

Spark is no different from a predominantly in-memory application.
Effectively it is doing the classical disk based hadoop  map-reduce
operation "in memory" to speed up the processing but it is still an
application on top of the OS.  So like mose applications, there is a state
of Spark, the code running and the OS(s), where disk usage will be needed.

This is akin to swap space on OS itself and I quote "Swap space is used when
your operating system decides that it needs physical memory for active
processes and the amount of available (unused) physical memory is
insufficient. When this happens, inactive pages from the physical memory
are then moved into the swap space, freeing up that physical memory for
other uses"

 free
  totalusedfree  shared  buff/cache
 available
Mem:   6565973230116700 1429436 234177234113596
32665372
Swap: 104857596  550912   104306684

HTH


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 20 Aug 2021 at 12:50, Jacek Laskowski  wrote:

> Hi,
>
> I've been exploring BlockManager and the stores for a while now and am
> tempted to say that a memory-only Spark setup would be possible (except
> shuffle blocks). Is this correct?
>
> What about shuffle blocks? Do they have to be stored on disk (in
> DiskStore)?
>
> I think broadcast variables are in-memory first so except on-disk storage
> level explicitly used (by Spark devs), there's no reason not to have Spark
> in-memory only.
>
> (I was told that one of the differences between Trino/Presto vs Spark SQL
> is that Trino keeps all processing in-memory only and will blow up while
> Spark uses disk to avoid OOMEs).
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books 
> Follow me on https://twitter.com/jaceklaskowski
>
> 
>


Re: Is memory-only no-disk Spark possible? [Marketing Mail]

2021-08-20 Thread Jack Kolokasis

Hello Jacek,

On 20/8/21 2:49 μ.μ., Jacek Laskowski wrote:

Hi,

I've been exploring BlockManager and the stores for a while now and am 
tempted to say that a memory-only Spark setup would be possible 
(except shuffle blocks). Is this correct?

Correct.


What about shuffle blocks? Do they have to be stored on disk (in 
DiskStore)?

Well, by default Spark stores shuffle blocks on disk.


I think broadcast variables are in-memory first so except on-disk 
storage level explicitly used (by Spark devs), there's no reason not 
to have Spark in-memory only.


(I was told that one of the differences between Trino/Presto vs Spark 
SQL is that Trino keeps all processing in-memory only and will blow up 
while Spark uses disk to avoid OOMEs).


Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski 
"The Internals Of" Online Books 
Follow me on https://twitter.com/jaceklaskowski 





Best,
Iacovos


Is memory-only no-disk Spark possible?

2021-08-20 Thread Jacek Laskowski
Hi,

I've been exploring BlockManager and the stores for a while now and am
tempted to say that a memory-only Spark setup would be possible (except
shuffle blocks). Is this correct?

What about shuffle blocks? Do they have to be stored on disk (in DiskStore)?

I think broadcast variables are in-memory first so except on-disk storage
level explicitly used (by Spark devs), there's no reason not to have Spark
in-memory only.

(I was told that one of the differences between Trino/Presto vs Spark SQL
is that Trino keeps all processing in-memory only and will blow up while
Spark uses disk to avoid OOMEs).

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books 
Follow me on https://twitter.com/jaceklaskowski