Why SparkSQL changes the table owner when performing alter table opertations?

2018-03-12 Thread 张万新
Hi,

When using spark.sql() to perform alter table operations I found that spark
changes the table owner property to the execution user. Then I digged into
the source code and found that in HiveClientImpl, the alterTable function
will set the owner of table to the current execution user. Besides, some
other operations like alter partition, getPartitionOption and so on do the
same. Can some experts explain why should we do this? What if just behave
like hive which does not change the owner when doing these kind of
operations?


Re: How to run spark shell using YARN

2018-03-12 Thread vermanurag
This does not look like Spark error. Looks like yarn has not been able to
allocate resources for spark driver. If you check resource manager UI you
are likely to see this as spark application waiting for resources. Try
reducing the driver node memory and/ or other bottlenecks based on what you
see in the resource manager UI. 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-12 Thread Tathagata Das
You have understood the problem right. However note that your
interpretation of the output *(K, leftValue, null), **(K, leftValue,
rightValue1), **(K, leftValue, rightValue2)* is subject to the knowledge of
the semantics of the join. That if you are processing the output rows
*manually*, you are being aware that the operator is a join where you can
make the semantics interpretation of *"null replaced by first match, then
all matches are just addition rows".* This is however not a general
solution for any sink, and for any operator. We need to find a way to
expose these semantics through the APIs such that a sink can use it without
the knowledge of exactly what operator is in the query writing to the sink.
Therefore we still need some work before we can do join in update mode
clearly.

Hope that makes it clear. :)

On Sat, Mar 10, 2018 at 12:14 AM, kant kodali  wrote:

> I will give an attempt to answer this.
>
> since rightValue1 and rightValue2 have the same key "K"(two matches) why
> would it ever be the case *rightValue2* replacing *rightValue1* replacing 
> *null?
> *Moreover, why does user need to care?
>
> The result in this case (after getting 2 matches) should be
>
> *(K, leftValue, rightValue1)*
> *(K, leftValue, rightValue2)*
>
> This basically means only one of them replaced the earlier null. which one
> of two? Depends on whichever arrived first. Other words, "null's" will be
> replaced by first matching row and subsequently, if there is a new matching
> row it will just be another row with the same key in the result table or if
> there a new unmatched row then the result table should have null's for the
> unmatched fields.
>
> From a user perspective, I believe just spitting out nulls for every
> trigger until there is a match and when there is match spitting out the
> joined rows should suffice isn't it?
>
> Sorry if my thoughts are too naive!
>
>
>
>
>
>
>
>
>
>
> On Thu, Mar 8, 2018 at 6:14 PM, Tathagata Das  > wrote:
>
>> This doc is unrelated to the stream-stream join we added in Structured
>> Streaming. :)
>>
>> That said we added append mode first because it easier to reason about
>> the semantics of append mode especially in the context of outer joins. You
>> output a row only when you know it wont be changed ever. The semantics of
>> update mode in outer joins is trickier to reason about and expose through
>> the APIs. Consider a left outer join. As soon as we get a left-side record
>> with a key K that does not have a match, do we output *(K, leftValue,
>> null)*? And if we do so, then later get 2 matches from the right side,
>> we have to output *(K, leftValue, rightValue1) and (K, leftValue,
>> rightValue2)*. But how do we convey that *rightValue1* and *rightValue2 
>> *together
>> replace the earlier *null*, rather than *rightValue2* replacing
>> *rightValue1* replacing *null?*
>>
>> We will figure these out in future releases. For now, we have released
>> append mode, which allow quite a large range of use cases, including
>> multiple cascading joins.
>>
>> TD
>>
>>
>>
>> On Thu, Mar 8, 2018 at 9:18 AM, Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>>> super interesting.
>>>
>>> On Wed, Mar 7, 2018 at 11:44 AM, kant kodali  wrote:
>>>
 It looks to me that the StateStore described in this doc
 
  Actually
 has full outer join and every other join is a filter of that. Also the doc
 talks about update mode but looks like Spark 2.3 ended up with append mode?
 Anyways the moment it is in master I am ready to test so JIRA tickets on
 this would help to keep track. please let me know.

 Thanks!

 On Tue, Mar 6, 2018 at 9:16 PM, kant kodali  wrote:

> Sorry I meant Spark 2.4 in my previous email
>
> On Tue, Mar 6, 2018 at 9:15 PM, kant kodali 
> wrote:
>
>> Hi TD,
>>
>> I agree I think we are better off either with a full fix or no fix. I
>> am ok with the complete fix being available in master or some branch. I
>> guess the solution for me is to just build from the source.
>>
>> On a similar note, I am not finding any JIRA tickets related to full
>> outer joins and update mode for maybe say Spark 2.3. I wonder how hard is
>> it two implement both of these? It turns out the update mode and full 
>> outer
>> join is very useful and required in my case, therefore, I'm just asking.
>>
>> Thanks!
>>
>> On Tue, Mar 6, 2018 at 6:25 PM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> I thought about it.
>>> I am not 100% sure whether this fix should go into 2.3.1.
>>>
>>> There are two parts to this bug fix to enable self-joins.
>>>
>>> 1. Enabling deduping of leaf logical nodes by extending
>>> 

Re: How to run spark shell using YARN

2018-03-12 Thread Marcelo Vanzin
Looks like you either have a misconfigured HDFS service, or you're
using the wrong configuration on the client.

BTW, as I said in the previous response, the message you saw initially
is *not* an error. If you're just trying things out, you don't need to
do anything and Spark should still work.

On Mon, Mar 12, 2018 at 6:13 PM, kant kodali  wrote:
> Hi,
>
> I read that doc several times now. I am stuck with the below error message
> when I run ./spark-shell --master yarn --deploy-mode client.
>
> I have my HADOOP_CONF_DIR set to /usr/local/hadoop-2.7.3/etc/hadoop and
> SPARK_HOME set to /usr/local/spark on all 3 machines (1 node for Resource
> Manager and NameNode, 2 Nodes for Node Manager and DataNodes).
>
> Any idea?
>
>
>
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /user/centos/.sparkStaging/application_1520898664848_0003/__spark_libs__2434167523839846774.zip
> could only be replicated to 0 nodes instead of minReplication (=1).  There
> are 2 datanode(s) running and no node(s) are excluded in this operation.
>
>
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1571)
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3107)
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3031)
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:725)
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492)
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> java.security.AccessController.doPrivileged(Native Method)
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> javax.security.auth.Subject.doAs(Subject.java:422)
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> 18/03/13
>
>
> Thanks!
>
>
> On Mon, Mar 12, 2018 at 4:46 PM, Marcelo Vanzin  wrote:
>>
>> That's not an error, just a warning. The docs [1] have more info about
>> the config options mentioned in that message.
>>
>> [1] http://spark.apache.org/docs/latest/running-on-yarn.html
>>
>> On Mon, Mar 12, 2018 at 4:42 PM, kant kodali  wrote:
>> > Hi All,
>> >
>> > I am trying to use YARN for the very first time. I believe I configured
>> > all
>> > the resource manager and name node fine. And then I run the below
>> > command
>> >
>> > ./spark-shell --master yarn --deploy-mode client
>> >
>> > I get the below output and it hangs there forever (I had been waiting
>> > over
>> > 10 minutes)
>> >
>> > 18/03/12 23:36:32 WARN Client: Neither spark.yarn.jars nor
>> > spark.yarn.archive is set, falling back to uploading libraries under
>> > SPARK_HOME.
>> >
>> > Any idea?
>> >
>> > Thanks!
>>
>>
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to run spark shell using YARN

2018-03-12 Thread kant kodali
Hi,

I read that doc several times now. I am stuck with the below error message
when I run ./spark-shell --master yarn --deploy-mode client.

I have my HADOOP_CONF_DIR set to /usr/local/hadoop-2.7.3/etc/hadoop and
SPARK_HOME set to /usr/local/spark on all 3 machines (1 node for Resource
Manager and NameNode, 2 Nodes for Node Manager and DataNodes).

Any idea?



*18/03/13 00:19:13 INFO LineBufferedStream: stdout:
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/user/centos/.sparkStaging/application_1520898664848_0003/__spark_libs__2434167523839846774.zip
could only be replicated to 0 nodes instead of minReplication (=1).  There
are 2 datanode(s) running and no node(s) are excluded in this operation.*


18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1571)
18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3107)
18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3031)
18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:725)
18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492)
18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
java.security.AccessController.doPrivileged(Native Method)
18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
javax.security.auth.Subject.doAs(Subject.java:422)
18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
18/03/13


Thanks!


On Mon, Mar 12, 2018 at 4:46 PM, Marcelo Vanzin  wrote:

> That's not an error, just a warning. The docs [1] have more info about
> the config options mentioned in that message.
>
> [1] http://spark.apache.org/docs/latest/running-on-yarn.html
>
> On Mon, Mar 12, 2018 at 4:42 PM, kant kodali  wrote:
> > Hi All,
> >
> > I am trying to use YARN for the very first time. I believe I configured
> all
> > the resource manager and name node fine. And then I run the below command
> >
> > ./spark-shell --master yarn --deploy-mode client
> >
> > I get the below output and it hangs there forever (I had been waiting
> over
> > 10 minutes)
> >
> > 18/03/12 23:36:32 WARN Client: Neither spark.yarn.jars nor
> > spark.yarn.archive is set, falling back to uploading libraries under
> > SPARK_HOME.
> >
> > Any idea?
> >
> > Thanks!
>
>
>
> --
> Marcelo
>


Re: How to run spark shell using YARN

2018-03-12 Thread Marcelo Vanzin
That's not an error, just a warning. The docs [1] have more info about
the config options mentioned in that message.

[1] http://spark.apache.org/docs/latest/running-on-yarn.html

On Mon, Mar 12, 2018 at 4:42 PM, kant kodali  wrote:
> Hi All,
>
> I am trying to use YARN for the very first time. I believe I configured all
> the resource manager and name node fine. And then I run the below command
>
> ./spark-shell --master yarn --deploy-mode client
>
> I get the below output and it hangs there forever (I had been waiting over
> 10 minutes)
>
> 18/03/12 23:36:32 WARN Client: Neither spark.yarn.jars nor
> spark.yarn.archive is set, falling back to uploading libraries under
> SPARK_HOME.
>
> Any idea?
>
> Thanks!



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How to run spark shell using YARN

2018-03-12 Thread kant kodali
Hi All,

I am trying to use YARN for the very first time. I believe I configured all
the resource manager and name node fine. And then I run the below command

./spark-shell --master yarn --deploy-mode client

*I get the below output and it hangs there forever *(I had been waiting
over 10 minutes)

18/03/12 23:36:32 WARN Client: Neither spark.yarn.jars nor
spark.yarn.archive is set, falling back to uploading libraries under
SPARK_HOME.

Any idea?

Thanks!


Re: OutOfDirectMemoryError for Spark 2.2

2018-03-12 Thread Dave Cameron
I believe jmap is only showing you the java heap used, but the program is
running out of direct memory space. They are two different pools of memory.

I haven't had to diagnose a direct memory problem before, but this blog
post has some suggestions of how to do it:
https://jkutner.github.io/2017/04/28/oh-the-places-your-java-memory-goes.html


On Thu, Mar 8, 2018 at 1:57 AM, Chawla,Sumit  wrote:

> Hi
>
> Anybody got any pointers on this one?
>
> Regards
> Sumit Chawla
>
>
> On Tue, Mar 6, 2018 at 8:58 AM, Chawla,Sumit 
> wrote:
>
>> No,  This is the only Stack trace i get.  I have tried DEBUG but didn't
>> notice much of a log change.
>>
>> Yes,  I have tried bumping MaxDirectMemorySize to get rid of this error.
>> It does work if i throw 4G+ memory at it.  However,  I am trying to
>> understand this behavior so that i can setup this number to appropriate
>> value.
>>
>> Regards
>> Sumit Chawla
>>
>>
>> On Tue, Mar 6, 2018 at 8:07 AM, Vadim Semenov 
>> wrote:
>>
>>> Do you have a trace? i.e. what's the source of `io.netty.*` calls?
>>>
>>> And have you tried bumping `-XX:MaxDirectMemorySize`?
>>>
>>> On Tue, Mar 6, 2018 at 12:45 AM, Chawla,Sumit 
>>> wrote:
>>>
 Hi All

 I have a job which processes a large dataset.  All items in the dataset
 are unrelated.  To save on cluster resources,  I process these items in
 chunks.  Since chunks are independent of each other,  I start and shut down
 the spark context for each chunk.  This allows me to keep DAG smaller and
 not retry the entire DAG in case of failures.   This mechanism used to work
 fine with Spark 1.6.  Now,  as we have moved to 2.2,  the job started
 failing with OutOfDirectMemoryError error.

 2018-03-03 22:00:59,687 WARN  [rpc-server-48-1]
 server.TransportChannelHandler 
 (TransportChannelHandler.java:exceptionCaught(78))
 - Exception in connection from /10.66.73.27:60374

 io.netty.util.internal.OutOfDirectMemoryError: failed to allocate
 8388608 byte(s) of direct memory (used: 1023410176, max: 1029177344)

 at io.netty.util.internal.PlatformDependent.incrementMemoryCoun
 ter(PlatformDependent.java:506)

 at io.netty.util.internal.PlatformDependent.allocateDirectNoCle
 aner(PlatformDependent.java:460)

 at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolAre
 na.java:701)

 at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:690)

 at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:237)

 at io.netty.buffer.PoolArena.allocate(PoolArena.java:213)

 at io.netty.buffer.PoolArena.allocate(PoolArena.java:141)

 at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(Poole
 dByteBufAllocator.java:271)

 at io.netty.buffer.AbstractByteBufAllocator.directBuffer(Abstra
 ctByteBufAllocator.java:177)

 at io.netty.buffer.AbstractByteBufAllocator.directBuffer(Abstra
 ctByteBufAllocator.java:168)

 at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractBy
 teBufAllocator.java:129)

 at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.all
 ocate(AdaptiveRecvByteBufAllocator.java:104)

 at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.re
 ad(AbstractNioByteChannel.java:117)

 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEven
 tLoop.java:564)

 I got some clue on what is causing this from https://github.com/netty/
 netty/issues/6343,  However I am not able to add up numbers on what is
 causing 1 GB of Direct Memory to fill up.

 Output from jmap


 7: 22230 1422720 io.netty.buffer.PoolSubpage

 12: 1370 804640 io.netty.buffer.PoolSubpage[]

 41: 3600 144000 io.netty.buffer.PoolChunkList

 98: 1440 46080 io.netty.buffer.PoolThreadCache$SubPageMemoryRegionCache

 113: 300 40800 io.netty.buffer.PoolArena$HeapArena

 114: 300 40800 io.netty.buffer.PoolArena$DirectArena

 192: 198 15840 io.netty.buffer.PoolChunk

 274: 120 8320 io.netty.buffer.PoolThreadCache$MemoryRegionCache[]

 406: 120 3840 io.netty.buffer.PoolThreadCache$NormalMemoryRegionCache

 422: 72 3552 io.netty.buffer.PoolArena[]

 458: 30 2640 io.netty.buffer.PooledUnsafeDirectByteBuf

 500: 36 2016 io.netty.buffer.PooledByteBufAllocator

 529: 32 1792 io.netty.buffer.UnpooledUnsafeHeapByteBuf

 589: 20 1440 io.netty.buffer.PoolThreadCache

 630: 37 1184 io.netty.buffer.EmptyByteBuf

 703: 36 864 io.netty.buffer.PooledByteBufAllocator$PoolThreadLocalCache

 852: 22 528 io.netty.buffer.AdvancedLeakAwareByteBuf

 889: 10 480 io.netty.buffer.SlicedAbstractByteBuf

 917: 8 448 io.netty.buffer.UnpooledHeapByteBuf

 1018: 20 320 

Spark UI Streaming batch time interval does not match batch interval

2018-03-12 Thread Jordan Pilat
Hello,

I am running a streaming app on Spark 2.1.2.  The batch interval is set to 
5000ms, and when I go to the "Streaming" tab in the Spark UI, it correctly 
reports a 5 second batch interval, but the list of batches below only shows one 
batch every two minutes (IE the batch time for each batch is two minutes after 
the prior batch).

Keep in mind, I am not referring to delay or processing time -- only when each 
batch starts.  What could account for this discrepancy?

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Time Series Functionality with Spark

2018-03-12 Thread Li Jin
Hi All,

This is Li Jin. We (me and my fellow colleagues at Two Sigma) have been
using Spark for time series analysis for the past two years and it has been
a success to scale up our time series analysis.

Recently, we start a conversation with Reynold about potential
opportunities to collaborate and improve time series functionalities with
Spark in general. For that, first we'd like to get some feedbacks from the
community w.r.t. requirements for time series analysis.

I have created a doc here to cover the what I think are the most common
patterns for time series analysis:
https://docs.google.com/document/d/1j69F4LLjetPykdfzedDN-EmP2f-I-5MPvVY4fugp2xs/edit?usp=sharing

If you have needs/use cases for time series analysis with Spark, we'd
appreciate your feedback.

Thank you


Re: Spark 2.3 submit on Kubernetes error

2018-03-12 Thread purna pradeep
Thanks Yinan,

I’m able to get kube-dns endpoints when I ran this command

kubectl get ep kube-dns —namespace=kube-system

Do I need to deploy under kube-system instead of default namespace

And please lemme know if you have any insights on Error1 ?

On Sun, Mar 11, 2018 at 8:26 PM Yinan Li  wrote:

> Spark on Kubernetes requires the presence of the kube-dns add-on properly
> configured. The executors connect to the driver through a headless
> Kubernetes service using the DNS name of the service. Can you check if you
> have the add-on installed in your cluster? This issue
> https://github.com/apache-spark-on-k8s/spark/issues/558 might help.
>
>
> On Sun, Mar 11, 2018 at 5:01 PM, purna pradeep 
> wrote:
>
>> Getting below errors when I’m trying to run spark-submit on k8 cluster
>>
>>
>> *Error 1*:This looks like a warning it doesn’t interrupt the app running
>> inside executor pod but keeps on getting this warning
>>
>>
>> *2018-03-09 11:15:21 WARN  WatchConnectionManager:192 - Exec Failure*
>> *java.io.EOFException*
>> *   at
>> okio.RealBufferedSource.require(RealBufferedSource.java:60)*
>> *   at
>> okio.RealBufferedSource.readByte(RealBufferedSource.java:73)*
>> *   at okhttp3.internal.ws
>> .WebSocketReader.readHeader(WebSocketReader.java:113)*
>> *   at okhttp3.internal.ws
>> .WebSocketReader.processNextFrame(WebSocketReader.java:97)*
>> *   at okhttp3.internal.ws
>> .RealWebSocket.loopReader(RealWebSocket.java:262)*
>> *   at okhttp3.internal.ws
>> .RealWebSocket$2.onResponse(RealWebSocket.java:201)*
>> *   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141)*
>> *   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)*
>> *   at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)*
>> *   at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)*
>> *   at java.lang.Thread.run(Thread.java:748)*
>>
>>
>>
>> *Error2:* This is intermittent error  which is failing the executor pod
>> to run
>>
>>
>> *org.apache.spark.SparkException: External scheduler cannot be
>> instantiated*
>> * at
>> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2747)*
>> * at org.apache.spark.SparkContext.(SparkContext.scala:492)*
>> * at
>> org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2486)*
>> * at
>> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:930)*
>> * at
>> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:921)*
>> * at scala.Option.getOrElse(Option.scala:121)*
>> * at
>> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921)*
>> * at
>> com.capitalone.quantum.spark.core.QuantumSession$.initialize(QuantumSession.scala:62)*
>> * at
>> com.capitalone.quantum.spark.core.QuantumSession$.getSparkSession(QuantumSession.scala:80)*
>> * at
>> com.capitalone.quantum.workflow.WorkflowApp$.getSession(WorkflowApp.scala:116)*
>> * at
>> com.capitalone.quantum.workflow.WorkflowApp$.main(WorkflowApp.scala:90)*
>> * at
>> com.capitalone.quantum.workflow.WorkflowApp.main(WorkflowApp.scala)*
>> *Caused by: io.fabric8.kubernetes.client.KubernetesClientException:
>> Operation: [get]  for kind: [Pod]  with name:
>> [myapp-ef79db3d9f4831bf85bda14145fdf113-driver-driver]  in namespace:
>> [default]  failed.*
>> * at
>> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:62)*
>> * at
>> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:71)*
>> * at
>> io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:228)*
>> * at
>> io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:184)*
>> * at
>> org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.(KubernetesClusterSchedulerBackend.scala:70)*
>> * at
>> org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:120)*
>> * at
>> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)*
>> * ... 11 more*
>> *Caused by: java.net.UnknownHostException: kubernetes.default.svc:
>> Try again*
>> * at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)*
>> * at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)*
>> * at
>> java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)*
>> * at java.net.InetAddress.getAllByName0(InetAddress.java:1276)*
>> * at java.net.InetAddress.getAllByName(InetAddress.java:1192)*
>> *   

Re: Creating DataFrame with the implicit localSeqToDatasetHolder has bad performance

2018-03-12 Thread msinton
I think I understand that in the second case the DataFrame is created as a
Local object, so it lives in the memory of the driver and is serialized as
part of the Task that gets sent to each executor. 

Though I think the implicit conversion here is something that others could
also misunderstand - maybe it would be better if it was not part of
spark.implicits? Or at least something can be said/warning in the developer
guides.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Creating DataFrame with the implicit localSeqToDatasetHolder has bad performance

2018-03-12 Thread msinton
Hi, Using Scala, spark version 2.3.0 (also 2.2.0):I've come across two main
ways to create a DataFrame from a sequence. The more
common:(0 until
10).toDF("value") *good*and the less common (but still
prevalent):(0 until 10).toDF("value")*bad*The latter
results in much worse performance (for example in,
df.agg(mean("value")).collect()). I don't know if it is a bug or a
misunderstanding that these two are equivalent?The latter appears to use the
implicit method localSeqToDatasetHolder while the former uses
rddToDatasetHolder.The difference in the physical plans is that the *good*
looks like:
*(1) SerializeFromObject [input[0, int, false] AS value#2]+- Scan
ExternalRDDScan[obj#1]
The *bad* looks like:
LocalTableScan [value#1]
Even if this is not a bug, it would be great to learn more about what is
going on here and why I see such a huge performance difference. I've tried
to find some resources that would help me understand more about this but
I've struggled to get anywhere. (Looking at the source code I can follow
/what/ is going on to generate these plans, but I haven't found the
/why/).Many thanks,Matt



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/