Re: Number of records per micro-batch in DStream vs Structured Streaming

2018-07-08 Thread subramgr
Any one?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to avoid duplicate column names after join with multiple conditions

2018-07-08 Thread Vamshi Talla
Nirav,

Spark does not create a duplicate column when you use the below join 
expression,  as an array of column(s) like below but that requires the column 
name to be same in both the data frames.

Example: df1.join(df2, [‘a’])

Thanks.
Vamshi Talla

On Jul 6, 2018, at 4:47 PM, Gokula Krishnan D 
mailto:email2...@gmail.com>> wrote:

Nirav,

withColumnRenamed() API might help but it does not different column and renames 
all the occurrences of the given column. either use select() API and rename as 
you want.



Thanks & Regards,
Gokula Krishnan (Gokul)

On Mon, Jul 2, 2018 at 5:52 PM, Nirav Patel 
mailto:npa...@xactlycorp.com>> wrote:
Expr is `df1(a) === df2(a) and df1(b) === df2(c)`

How to avoid duplicate column 'a' in result? I don't see any api that combines 
both. Rename manually?



[What's New with 
Xactly]

[https://www.xactlycorp.com/wp-content/uploads/2017/09/insta.png]
  [https://www.xactlycorp.com/wp-content/uploads/2017/09/linkedin.png] 

   [https://www.xactlycorp.com/wp-content/uploads/2017/09/twitter.png] 

   [https://www.xactlycorp.com/wp-content/uploads/2017/09/facebook.png] 

   [https://www.xactlycorp.com/wp-content/uploads/2017/09/youtube.png] 





Re: repartition

2018-07-08 Thread Vamshi Talla
Hi Ravi,

RDDs are always immutable, so you cannot change them, instead you create new 
ones by transforming one. Repartition is a transformation, so it lazily 
evaluated, hence computed only when you call an action on it.

Thanks.
Vamshi Talla

On Jul 8, 2018, at 12:26 PM, mailto:ryanda...@gmail.com>> 
mailto:ryanda...@gmail.com>> wrote:

Hi,

Can anyone clarify how repartition works please ?


  *   I have a DataFrame df which has only one partition:


// Returns 1
df.rdd.getNumPartitions

• I repartitioned it by passing “3” and assigned it a new DataFrame 
newdf

val newdf = df.repartition(3)


• newdf shows 3 as number of partitions
// Returns 3
newdf.rdd.getNumPartitions

• df still shows 1

// Return 1
df.rdd.getNumPartitions

My Question is that,


  1.  How does repartition work ? Does it copy original dataframe and create X 
partitions as specified by  repartition ?  If that is the case, aren’t there 
two copies of same data in memory as shown in below diagram ?

Or my understanding is incorrect ?

As per executions above,  looks like there are two copies as after repartition, 
df still has 1 partition !!


  1.  Repartition is executed immediately or it waits for some trigger [kind of 
action] ?





Thanks,
Ravi



Re: spark-shell gets stuck in ACCEPTED state forever when ran in YARN client mode.

2018-07-08 Thread yohann jardin
When you run on Yarn, you don’t even need to start a spark cluster (spark 
master and slaves). Yarn receives a job and then allocate resources for the 
application master and then its workers.

Check the resources available in the node section of the resource manager UI 
(and is your node actually detected as alive?), as well as the scheduler 
section to check the default queue resources.
If you seem to lack resources for your driver, you can try to reduce the driver 
memory by specifying “--driver-memory 512” for example, but I’d expect the 
default of 1g to be low enough based on what you showed us.

Yohann Jardin

Le 7/8/2018 à 6:11 PM, kant kodali a écrit :
@yohann sorry I am assuming you meant application master if so I believe spark 
is the one that provides application master. Is there anyway to look for how 
much resources are being requested and how much yarn is allowed to provide? I 
would assume this is a common case if so I am not sure why these numbers are 
not part of resource manager logs?

On Sun, Jul 8, 2018 at 8:09 AM, kant kodali 
mailto:kanth...@gmail.com>> wrote:
yarn.scheduler.capacity.maximum-am-resource-percent by default is set to 0.1 
and I tried changing it to 1.0 and still no luck. same problem persists. The 
master here is yarn and I just trying to spawn spark-shell --master yarn 
--deploy-mode client and run a simple world count so I am not sure why it would 
request for more resources?

On Sun, Jul 8, 2018 at 8:02 AM, yohann jardin 
mailto:yohannjar...@hotmail.com>> wrote:

Following the logs from the resource manager:

2018-07-08 07:23:23,382 WARN 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
maximum-am-resource-percent is insufficient to start a single application in 
queue, it is likely set too low. skipping enforcement to allow at least one 
application to start

2018-07-08 07:23:23,382 WARN 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
maximum-am-resource-percent is insufficient to start a single application in 
queue for user, it is likely set too low. skipping enforcement to allow at 
least one application to start

I’d say it has nothing to do with spark. Your master is just asking more 
resources than the default Yarn queue is allowed to provide.
You might take a look at 
https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
 and search for maximum-am-resource-percent.

Regards,

Yohann Jardin

Le 7/8/2018 à 4:40 PM, kant kodali a écrit :
Hi,

It's on local mac book pro machine that has 16GB RAM 512GB disk and 8 vCpu! I 
am not running any code since I can't even spawn spark-shell with yarn as 
master as described in my previous email. I just want to run simple word count 
using yarn as master.

Thanks!

Below is the resource manager log once again if that helps


2018-07-08 07:23:23,343 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Application added - appId: application_1531059242261_0001 user: xxx leaf-queue 
of parent: root #applications: 1

2018-07-08 07:23:23,344 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Accepted application application_1531059242261_0001 from user: xxx, in queue: 
default

2018-07-08 07:23:23,350 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
application_1531059242261_0001 State change from SUBMITTED to ACCEPTED on 
event=APP_ACCEPTED

2018-07-08 07:23:23,370 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
Registering app attempt : appattempt_1531059242261_0001_01

2018-07-08 07:23:23,370 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1531059242261_0001_01 State change from NEW to SUBMITTED

2018-07-08 07:23:23,382 WARN 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
maximum-am-resource-percent is insufficient to start a single application in 
queue, it is likely set too low. skipping enforcement to allow at least one 
application to start

2018-07-08 07:23:23,382 WARN 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
maximum-am-resource-percent is insufficient to start a single application in 
queue for user, it is likely set too low. skipping enforcement to allow at 
least one application to start

2018-07-08 07:23:23,382 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
Application application_1531059242261_0001 from user: xxx activated in queue: 
default

2018-07-08 07:23:23,382 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
Application added - appId: application_1531059242261_0001 user: 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue$User@476750cd,
 leaf-queue: default #user-pending-applications: 0 #user-active-applications: 1 
#queue-pending-applications: 0 #queue-active-applications: 1


repartition

2018-07-08 Thread ryandam.9
Hi, 

 

Can anyone clarify how repartition works please ?

 

*   I have a DataFrame df which has only one partition:

 

// Returns 1
df.rdd.getNumPartitions



* I repartitioned it by passing "3" and assigned it a new DataFrame
newdf


val newdf = df.repartition(3)



 

* newdf shows 3 as number of partitions
// Returns 3
newdf.rdd.getNumPartitions



* df still shows 1


// Return 1
df.rdd.getNumPartitions

 

My Question is that, 

 

1.  How does repartition work ? Does it copy original dataframe and
create X partitions as specified by  repartition ?  If that is the case,
aren't there two copies of same data in memory as shown in below diagram ?

Or my understanding is incorrect ?  

 

As per executions above,  looks like there are two copies as after
repartition, df still has 1 partition !!

 

2.  Repartition is executed immediately or it waits for some trigger
[kind of action] ?

 

 



 

Thanks,

Ravi



Re: spark-shell gets stuck in ACCEPTED state forever when ran in YARN client mode.

2018-07-08 Thread kant kodali
@yohann sorry I am assuming you meant application master if so I believe
spark is the one that provides application master. Is there anyway to look
for how much resources are being requested and how much yarn is allowed to
provide? I would assume this is a common case if so I am not sure why these
numbers are not part of resource manager logs?

On Sun, Jul 8, 2018 at 8:09 AM, kant kodali  wrote:

> yarn.scheduler.capacity.maximum-am-resource-percent by default is set to
> 0.1 and I tried changing it to 1.0 and still no luck. same problem
> persists. The master here is yarn and I just trying to spawn spark-shell
> --master yarn --deploy-mode client and run a simple world count so I am not
> sure why it would request for more resources?
>
> On Sun, Jul 8, 2018 at 8:02 AM, yohann jardin 
> wrote:
>
>> Following the logs from the resource manager:
>>
>> 2018-07-08 07:23:23,382 WARN org.apache.hadoop.yarn.server.
>> resourcemanager.scheduler.capacity.LeafQueue:
>> maximum-am-resource-percent is insufficient to start a single
>> application in queue, it is likely set too low. skipping enforcement to
>> allow at least one application to start
>>
>> 2018-07-08 07:23:23,382 WARN org.apache.hadoop.yarn.server.
>> resourcemanager.scheduler.capacity.LeafQueue:
>> maximum-am-resource-percent is insufficient to start a single
>> application in queue for user, it is likely set too low. skipping
>> enforcement to allow at least one application to start
>>
>> I’d say it has nothing to do with spark. Your master is just asking more
>> resources than the default Yarn queue is allowed to provide.
>> You might take a look at https://hadoop.apache.org/docs
>> /r2.7.3/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html and search
>> for maximum-am-resource-percent.
>>
>> Regards,
>>
>> *Yohann Jardin*
>> Le 7/8/2018 à 4:40 PM, kant kodali a écrit :
>>
>> Hi,
>>
>> It's on local mac book pro machine that has 16GB RAM 512GB disk and 8
>> vCpu! I am not running any code since I can't even spawn spark-shell with
>> yarn as master as described in my previous email. I just want to run simple
>> word count using yarn as master.
>>
>> Thanks!
>>
>> Below is the resource manager log once again if that helps
>>
>>
>> 2018-07-08 07:23:23,343 INFO org.apache.hadoop.yarn.server.
>> resourcemanager.scheduler.capacity.ParentQueue: Application added -
>> appId: application_1531059242261_0001 user: xxx leaf-queue of parent: root 
>> #applications:
>> 1
>>
>> 2018-07-08 07:23:23,344 INFO org.apache.hadoop.yarn.server.
>> resourcemanager.scheduler.capacity.CapacityScheduler: Accepted
>> application application_1531059242261_0001 from user: xxx, in queue:
>> default
>>
>> 2018-07-08 07:23:23,350 INFO org.apache.hadoop.yarn.server.
>> resourcemanager.rmapp.RMAppImpl: application_1531059242261_0001 State
>> change from SUBMITTED to ACCEPTED on event=APP_ACCEPTED
>>
>> 2018-07-08 07:23:23,370 INFO org.apache.hadoop.yarn.server.
>> resourcemanager.ApplicationMasterService: Registering app attempt :
>> appattempt_1531059242261_0001_01
>>
>> 2018-07-08 07:23:23,370 INFO org.apache.hadoop.yarn.server.
>> resourcemanager.rmapp.attempt.RMAppAttemptImpl:
>> appattempt_1531059242261_0001_01 State change from NEW to SUBMITTED
>>
>> 2018-07-08 07:23:23,382 WARN org.apache.hadoop.yarn.server.
>> resourcemanager.scheduler.capacity.LeafQueue:
>> maximum-am-resource-percent is insufficient to start a single
>> application in queue, it is likely set too low. skipping enforcement to
>> allow at least one application to start
>>
>> 2018-07-08 07:23:23,382 WARN org.apache.hadoop.yarn.server.
>> resourcemanager.scheduler.capacity.LeafQueue:
>> maximum-am-resource-percent is insufficient to start a single
>> application in queue for user, it is likely set too low. skipping
>> enforcement to allow at least one application to start
>>
>> 2018-07-08 07:23:23,382 INFO org.apache.hadoop.yarn.server.
>> resourcemanager.scheduler.capacity.LeafQueue: Application
>> application_1531059242261_0001 from user: xxx activated in queue: default
>>
>> 2018-07-08 07:23:23,382 INFO org.apache.hadoop.yarn.server.
>> resourcemanager.scheduler.capacity.LeafQueue: Application added - appId:
>> application_1531059242261_0001 user: org.apache.hadoop.yarn.server.
>> resourcemanager.scheduler.capacity.LeafQueue$User@476750cd, leaf-queue:
>> default #user-pending-applications: 0 #user-active-applications: 1
>> #queue-pending-applications: 0 #queue-active-applications: 1
>>
>> 2018-07-08 07:23:23,382 INFO org.apache.hadoop.yarn.server.
>> resourcemanager.scheduler.capacity.CapacityScheduler: Added Application
>> Attempt appattempt_1531059242261_0001_01 to scheduler from user xxx
>> in queue default
>>
>> 2018-07-08 07:23:23,386 INFO org.apache.hadoop.yarn.server.
>> resourcemanager.rmapp.attempt.RMAppAttemptImpl:
>> appattempt_1531059242261_0001_01 State change from SUBMITTED to
>> SCHEDULED
>>
>>
>>
>>
>


Re: spark-shell gets stuck in ACCEPTED state forever when ran in YARN client mode.

2018-07-08 Thread kant kodali
yarn.scheduler.capacity.maximum-am-resource-percent by default is set to
0.1 and I tried changing it to 1.0 and still no luck. same problem
persists. The master here is yarn and I just trying to spawn spark-shell
--master yarn --deploy-mode client and run a simple world count so I am not
sure why it would request for more resources?

On Sun, Jul 8, 2018 at 8:02 AM, yohann jardin 
wrote:

> Following the logs from the resource manager:
>
> 2018-07-08 07:23:23,382 WARN org.apache.hadoop.yarn.server.
> resourcemanager.scheduler.capacity.LeafQueue: maximum-am-resource-percent
> is insufficient to start a single application in queue, it is likely set
> too low. skipping enforcement to allow at least one application to start
>
> 2018-07-08 07:23:23,382 WARN org.apache.hadoop.yarn.server.
> resourcemanager.scheduler.capacity.LeafQueue: maximum-am-resource-percent
> is insufficient to start a single application in queue for user, it is
> likely set too low. skipping enforcement to allow at least one
> application to start
>
> I’d say it has nothing to do with spark. Your master is just asking more
> resources than the default Yarn queue is allowed to provide.
> You might take a look at https://hadoop.apache.org/
> docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html and
> search for maximum-am-resource-percent.
>
> Regards,
>
> *Yohann Jardin*
> Le 7/8/2018 à 4:40 PM, kant kodali a écrit :
>
> Hi,
>
> It's on local mac book pro machine that has 16GB RAM 512GB disk and 8
> vCpu! I am not running any code since I can't even spawn spark-shell with
> yarn as master as described in my previous email. I just want to run simple
> word count using yarn as master.
>
> Thanks!
>
> Below is the resource manager log once again if that helps
>
>
> 2018-07-08 07:23:23,343 INFO org.apache.hadoop.yarn.server.
> resourcemanager.scheduler.capacity.ParentQueue: Application added -
> appId: application_1531059242261_0001 user: xxx leaf-queue of parent: root 
> #applications:
> 1
>
> 2018-07-08 07:23:23,344 INFO org.apache.hadoop.yarn.server.
> resourcemanager.scheduler.capacity.CapacityScheduler: Accepted
> application application_1531059242261_0001 from user: xxx, in queue:
> default
>
> 2018-07-08 07:23:23,350 INFO org.apache.hadoop.yarn.server.
> resourcemanager.rmapp.RMAppImpl: application_1531059242261_0001 State
> change from SUBMITTED to ACCEPTED on event=APP_ACCEPTED
>
> 2018-07-08 07:23:23,370 INFO org.apache.hadoop.yarn.server.
> resourcemanager.ApplicationMasterService: Registering app attempt :
> appattempt_1531059242261_0001_01
>
> 2018-07-08 07:23:23,370 INFO org.apache.hadoop.yarn.server.
> resourcemanager.rmapp.attempt.RMAppAttemptImpl:
> appattempt_1531059242261_0001_01 State change from NEW to SUBMITTED
>
> 2018-07-08 07:23:23,382 WARN org.apache.hadoop.yarn.server.
> resourcemanager.scheduler.capacity.LeafQueue: maximum-am-resource-percent
> is insufficient to start a single application in queue, it is likely set
> too low. skipping enforcement to allow at least one application to start
>
> 2018-07-08 07:23:23,382 WARN org.apache.hadoop.yarn.server.
> resourcemanager.scheduler.capacity.LeafQueue: maximum-am-resource-percent
> is insufficient to start a single application in queue for user, it is
> likely set too low. skipping enforcement to allow at least one
> application to start
>
> 2018-07-08 07:23:23,382 INFO org.apache.hadoop.yarn.server.
> resourcemanager.scheduler.capacity.LeafQueue: Application
> application_1531059242261_0001 from user: xxx activated in queue: default
>
> 2018-07-08 07:23:23,382 INFO org.apache.hadoop.yarn.server.
> resourcemanager.scheduler.capacity.LeafQueue: Application added - appId:
> application_1531059242261_0001 user: org.apache.hadoop.yarn.server.
> resourcemanager.scheduler.capacity.LeafQueue$User@476750cd, leaf-queue:
> default #user-pending-applications: 0 #user-active-applications: 1
> #queue-pending-applications: 0 #queue-active-applications: 1
>
> 2018-07-08 07:23:23,382 INFO org.apache.hadoop.yarn.server.
> resourcemanager.scheduler.capacity.CapacityScheduler: Added Application
> Attempt appattempt_1531059242261_0001_01 to scheduler from user xxx in
> queue default
>
> 2018-07-08 07:23:23,386 INFO org.apache.hadoop.yarn.server.
> resourcemanager.rmapp.attempt.RMAppAttemptImpl:
> appattempt_1531059242261_0001_01 State change from SUBMITTED to
> SCHEDULED
>
>
>
>


Re: spark-shell gets stuck in ACCEPTED state forever when ran in YARN client mode.

2018-07-08 Thread yohann jardin
Following the logs from the resource manager:

2018-07-08 07:23:23,382 WARN 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
maximum-am-resource-percent is insufficient to start a single application in 
queue, it is likely set too low. skipping enforcement to allow at least one 
application to start

2018-07-08 07:23:23,382 WARN 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
maximum-am-resource-percent is insufficient to start a single application in 
queue for user, it is likely set too low. skipping enforcement to allow at 
least one application to start

I’d say it has nothing to do with spark. Your master is just asking more 
resources than the default Yarn queue is allowed to provide.
You might take a look at 
https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
 and search for maximum-am-resource-percent.

Regards,

Yohann Jardin

Le 7/8/2018 à 4:40 PM, kant kodali a écrit :
Hi,

It's on local mac book pro machine that has 16GB RAM 512GB disk and 8 vCpu! I 
am not running any code since I can't even spawn spark-shell with yarn as 
master as described in my previous email. I just want to run simple word count 
using yarn as master.

Thanks!

Below is the resource manager log once again if that helps


2018-07-08 07:23:23,343 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Application added - appId: application_1531059242261_0001 user: xxx leaf-queue 
of parent: root #applications: 1

2018-07-08 07:23:23,344 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Accepted application application_1531059242261_0001 from user: xxx, in queue: 
default

2018-07-08 07:23:23,350 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
application_1531059242261_0001 State change from SUBMITTED to ACCEPTED on 
event=APP_ACCEPTED

2018-07-08 07:23:23,370 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
Registering app attempt : appattempt_1531059242261_0001_01

2018-07-08 07:23:23,370 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1531059242261_0001_01 State change from NEW to SUBMITTED

2018-07-08 07:23:23,382 WARN 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
maximum-am-resource-percent is insufficient to start a single application in 
queue, it is likely set too low. skipping enforcement to allow at least one 
application to start

2018-07-08 07:23:23,382 WARN 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
maximum-am-resource-percent is insufficient to start a single application in 
queue for user, it is likely set too low. skipping enforcement to allow at 
least one application to start

2018-07-08 07:23:23,382 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
Application application_1531059242261_0001 from user: xxx activated in queue: 
default

2018-07-08 07:23:23,382 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
Application added - appId: application_1531059242261_0001 user: 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue$User@476750cd,
 leaf-queue: default #user-pending-applications: 0 #user-active-applications: 1 
#queue-pending-applications: 0 #queue-active-applications: 1

2018-07-08 07:23:23,382 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Added Application Attempt appattempt_1531059242261_0001_01 to scheduler 
from user xxx in queue default

2018-07-08 07:23:23,386 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1531059242261_0001_01 State change from SUBMITTED to SCHEDULED





Re: Create an Empty dataframe

2018-07-08 Thread रविशंकर नायर
>From Stackoverflow:

from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import StringType

sc = SparkContext(conf=SparkConf())
spark = SparkSession(sc) # Need to use SparkSession(sc) to
createDataFrame

schema = StructType([
StructField("column1",StringType(),True),
StructField("column2",StringType(),True)
])
empty = spark.createDataFrame(sc.emptyRDD(), schema)
empty = empty.unionAll(addOndata)

Best,
Ravion

On Sun, Jul 8, 2018 at 10:44 AM Shmuel Blitz 
wrote:

> Hi Dimitris,
>
> Could you explain your use case in a bit more details?
>
> What you are asking for, if I understand you correctly, is not the advised
> way to go about.
>
> If you're running analytics and expect their output to be a Dataframe with
> the specified columns, then you should compose your queries in such a way
> that they result in a DataFrame.
>
> If your preparing data to be analyzed (i.e. getting the input ready for
> manipulation), then I expect you to be doing one of the following:
> a. Read in the data using one of Spark's provided input APIs (e.g. reading
> a parquet file directly into a DataFrame)
> b. Read/prepare your data as a standard collection in your language
> (Python, in your case, but the same in Scala/Java/etc.), and then use
> Spark's API to parallelize the data and/or convert it into a DataFrame.
>
> That way or another, you want to be using Spark API for work that should
> be distributed to workers (heavy load, large amounts of data), and use your
> native language API, which usually is much more powerful, to run
> bootstrapping and light-weight preparations.
>
> Regards,
> Shmuel
>
> On Sat, Jun 30, 2018 at 6:51 PM Apostolos N. Papadopoulos <
> papad...@csd.auth.gr> wrote:
>
>> Hi Dimitri,
>>
>> you can do the following:
>>
>> 1. create an initial dataframe from an empty csv
>>
>> 2. use "union" to insert new rows
>>
>> Do not forget that Spark cannot replace a DBMS. Spark is mainly be used
>> for analytics.
>>
>> If you need select/insert/delete/update capabilities, perhaps you should
>> look at a DBMS.
>>
>>
>> Another alternative, in case you need "append only" semantics, is to use
>> streaming or structured streaming.
>>
>>
>> regards,
>>
>> Apostolos
>>
>>
>>
>>
>> On 30/06/2018 05:46 μμ, dimitris plakas wrote:
>> > I am new to Pyspark and want to initialize a new empty dataframe with
>> > sqlContext() with two columns ("Column1", "Column2"), and i want to
>> > append rows dynamically in a for loop.
>> > Is there any way to achieve this?
>> >
>> > Thank you in advance.
>>
>> --
>> Apostolos N. Papadopoulos, Associate Professor
>> Department of Informatics
>> Aristotle University of Thessaloniki
>> Thessaloniki, GREECE
>> tel: ++0030312310991918
>> email: papad...@csd.auth.gr
>> twitter: @papadopoulos_ap
>> web: http://delab.csd.auth.gr/~apostol
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
> --
> Shmuel Blitz
> Big Data Developer
> Email: shmuel.bl...@similarweb.com
> www.similarweb.com
> 
>
> 
> 
> 
>


Re: spark-shell gets stuck in ACCEPTED state forever when ran in YARN client mode.

2018-07-08 Thread रविशंकर नायर
Are you able to run a simple Map Reduce job on yarn without any issues?

If you have any issues: I had this problem on Mac. Use CSRUTIL in Mac, to
disable it. Then add a softlink

sudo ln –s  /usr/bin/java/bin/java


The new versions of Mac from EL Captain does not allow softlinks in
/bin/java.


I got everything working by above.


Best,

Ravion



On Sun, Jul 8, 2018 at 10:20 AM Marco Mistroni  wrote:

> You running on emr? You checked the emr logs?
> Was in similar situation where job was stuck in accepted and then it
> died..turned out to be an issue w. My code when running g with huge
> data.perhaps try to reduce gradually the load til it works and then start
> from there?
> Not a huge help but I followed same when. My job was stuck on accepted
> Hth
>
> On Sun, Jul 8, 2018, 2:59 PM kant kodali  wrote:
>
>> Hi All,
>>
>> I am trying to run a simple word count using YARN as a cluster manager.
>> I am currently using Spark 2.3.1 and Apache hadoop 2.7.3.  When I spawn
>> spark-shell like below it gets stuck in ACCEPTED stated forever.
>>
>> ./bin/spark-shell --master yarn --deploy-mode client
>>
>>
>> I set my log4j.properties in SPARK_HOME/conf to TRACE
>>
>>  queue: "default" name: "Spark shell" host: "N/A" rpc_port: -1
>> yarn_application_state: ACCEPTED trackingUrl: "
>> http://Kants-MacBook-Pro-2.local:8088/proxy/application_1531056583425_0001/;
>> diagnostics: "" startTime: 1531056632496 finishTime: 0
>> final_application_status: APP_UNDEFINED app_resource_Usage {
>> num_used_containers: 0 num_reserved_containers: 0 used_resources { memory:
>> 0 virtual_cores: 0 } reserved_resources { memory: 0 virtual_cores: 0 }
>> needed_resources { memory: 0 virtual_cores: 0 } memory_seconds: 0
>> vcore_seconds: 0 } originalTrackingUrl: "N/A" currentApplicationAttemptId {
>> application_id { id: 1 cluster_timestamp: 1531056583425 } attemptId: 1 }
>> progress: 0.0 applicationType: "SPARK" }}
>>
>> 18/07/08 06:32:22 INFO Client: Application report for
>> application_1531056583425_0001 (state: ACCEPTED)
>>
>> 18/07/08 06:32:22 DEBUG Client:
>>
>> client token: N/A
>>
>> diagnostics: N/A
>>
>> ApplicationMaster host: N/A
>>
>> ApplicationMaster RPC port: -1
>>
>> queue: default
>>
>> start time: 1531056632496
>>
>> final status: UNDEFINED
>>
>> tracking URL:
>> http://xxx-MacBook-Pro-2.local:8088/proxy/application_1531056583425_0001/
>>
>> user: xxx
>>
>>
>>
>> 18/07/08 06:32:20 DEBUG Client:
>>
>> client token: N/A
>>
>> diagnostics: N/A
>>
>> ApplicationMaster host: N/A
>>
>> ApplicationMaster RPC port: -1
>>
>> queue: default
>>
>> start time: 1531056632496
>>
>> final status: UNDEFINED
>>
>> tracking URL:
>> http://Kants-MacBook-Pro-2.local:8088/proxy/application_1531056583425_0001/
>>
>> user: kantkodali
>>
>>
>> 18/07/08 06:32:21 TRACE ProtobufRpcEngine: 1: Call -> /0.0.0.0:8032:
>> getApplicationReport {application_id { id: 1 cluster_timestamp:
>> 1531056583425 }}
>>
>> 18/07/08 06:32:21 DEBUG Client: IPC Client (1608805714) connection to /
>> 0.0.0.0:8032 from kantkodali sending #136
>>
>> 18/07/08 06:32:21 DEBUG Client: IPC Client (1608805714) connection to /
>> 0.0.0.0:8032 from kantkodali got value #136
>>
>> 18/07/08 06:32:21 DEBUG ProtobufRpcEngine: Call: getApplicationReport
>> took 1ms
>>
>> 18/07/08 06:32:21 TRACE ProtobufRpcEngine: 1: Response <- /0.0.0.0:8032:
>> getApplicationReport {application_report { applicationId { id: 1
>> cluster_timestamp: 1531056583425 } user: "xxx" queue: "default" name:
>> "Spark shell" host: "N/A" rpc_port: -1 yarn_application_state: ACCEPTED
>> trackingUrl: "
>> http://xxx-MacBook-Pro-2.local:8088/proxy/application_1531056583425_0001/;
>> diagnostics: "" startTime: 1531056632496 finishTime: 0
>> final_application_status: APP_UNDEFINED app_resource_Usage {
>> num_used_containers: 0 num_reserved_containers: 0 used_resources { memory:
>> 0 virtual_cores: 0 } reserved_resources { memory: 0 virtual_cores: 0 }
>> needed_resources { memory: 0 virtual_cores: 0 } memory_seconds: 0
>> vcore_seconds: 0 } originalTrackingUrl: "N/A" currentApplicationAttemptId {
>> application_id { id: 1 cluster_timestamp: 1531056583425 } attemptId: 1 }
>> progress: 0.0 applicationType: "SPARK" }}
>>
>> 18/07/08 06:32:21 INFO Client: Application report for
>> application_1531056583425_0001 (state: ACCEPTED)
>>
>>
>> I have read this link
>> 
>>  and
>> here are the conf files that are different from default settings
>>
>>
>> *yarn-site.xml*
>>
>>
>> 
>>
>>
>> 
>>
>> yarn.nodemanager.aux-services
>>
>> mapreduce_shuffle
>>
>> 
>>
>>
>> 
>>
>> yarn.nodemanager.resource.memory-mb
>>
>> 16384
>>
>> 
>>
>>
>> 
>>
>>yarn.scheduler.minimum-allocation-mb
>>
>>256
>>
>> 
>>
>>
>> 
>>
>>yarn.scheduler.maximum-allocation-mb
>>
>>8192
>>
>> 
>>
>>
>>
>>
>>yarn.nodemanager.resource.cpu-vcores
>>
>>8
>>
>>  

Re: Create an Empty dataframe

2018-07-08 Thread Shmuel Blitz
Hi Dimitris,

Could you explain your use case in a bit more details?

What you are asking for, if I understand you correctly, is not the advised
way to go about.

If you're running analytics and expect their output to be a Dataframe with
the specified columns, then you should compose your queries in such a way
that they result in a DataFrame.

If your preparing data to be analyzed (i.e. getting the input ready for
manipulation), then I expect you to be doing one of the following:
a. Read in the data using one of Spark's provided input APIs (e.g. reading
a parquet file directly into a DataFrame)
b. Read/prepare your data as a standard collection in your language
(Python, in your case, but the same in Scala/Java/etc.), and then use
Spark's API to parallelize the data and/or convert it into a DataFrame.

That way or another, you want to be using Spark API for work that should be
distributed to workers (heavy load, large amounts of data), and use your
native language API, which usually is much more powerful, to run
bootstrapping and light-weight preparations.

Regards,
Shmuel

On Sat, Jun 30, 2018 at 6:51 PM Apostolos N. Papadopoulos <
papad...@csd.auth.gr> wrote:

> Hi Dimitri,
>
> you can do the following:
>
> 1. create an initial dataframe from an empty csv
>
> 2. use "union" to insert new rows
>
> Do not forget that Spark cannot replace a DBMS. Spark is mainly be used
> for analytics.
>
> If you need select/insert/delete/update capabilities, perhaps you should
> look at a DBMS.
>
>
> Another alternative, in case you need "append only" semantics, is to use
> streaming or structured streaming.
>
>
> regards,
>
> Apostolos
>
>
>
>
> On 30/06/2018 05:46 μμ, dimitris plakas wrote:
> > I am new to Pyspark and want to initialize a new empty dataframe with
> > sqlContext() with two columns ("Column1", "Column2"), and i want to
> > append rows dynamically in a for loop.
> > Is there any way to achieve this?
> >
> > Thank you in advance.
>
> --
> Apostolos N. Papadopoulos, Associate Professor
> Department of Informatics
> Aristotle University of Thessaloniki
> Thessaloniki, GREECE
> tel: ++0030312310991918
> email: papad...@csd.auth.gr
> twitter: @papadopoulos_ap
> web: http://delab.csd.auth.gr/~apostol
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 
Shmuel Blitz
Big Data Developer
Email: shmuel.bl...@similarweb.com
www.similarweb.com






Re: spark-shell gets stuck in ACCEPTED state forever when ran in YARN client mode.

2018-07-08 Thread kant kodali
Hi,

It's on local mac book pro machine that has 16GB RAM 512GB disk and 8 vCpu!
I am not running any code since I can't even spawn spark-shell with yarn as
master as described in my previous email. I just want to run simple word
count using yarn as master.

Thanks!

Below is the resource manager log once again if that helps


2018-07-08 07:23:23,343 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
Application added - appId: application_1531059242261_0001 user: xxx
leaf-queue of parent: root #applications: 1

2018-07-08 07:23:23,344 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Accepted application application_1531059242261_0001 from user: xxx, in
queue: default

2018-07-08 07:23:23,350 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1531059242261_0001 State change from SUBMITTED to ACCEPTED on
event=APP_ACCEPTED

2018-07-08 07:23:23,370 INFO
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
Registering app attempt : appattempt_1531059242261_0001_01

2018-07-08 07:23:23,370 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1531059242261_0001_01 State change from NEW to SUBMITTED

2018-07-08 07:23:23,382 WARN
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
maximum-am-resource-percent is insufficient to start a single application in
queue, it is likely set too low. skipping enforcement to allow at least one
application to start

2018-07-08 07:23:23,382 WARN
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
maximum-am-resource-percent is insufficient to start a single application in
queue for user, it is likely set too low. skipping enforcement to allow at
least one application to start

2018-07-08 07:23:23,382 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
Application application_1531059242261_0001 from user: xxx activated in
queue: default

2018-07-08 07:23:23,382 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
Application added - appId: application_1531059242261_0001 user:
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue$User@
476750cd, leaf-queue: default #user-pending-applications: 0
#user-active-applications: 1 #queue-pending-applications: 0
#queue-active-applications: 1

2018-07-08 07:23:23,382 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Added Application Attempt appattempt_1531059242261_0001_01 to scheduler
from user xxx in queue default

2018-07-08 07:23:23,386 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1531059242261_0001_01 State change from SUBMITTED to
SCHEDULED


Re: spark-shell gets stuck in ACCEPTED state forever when ran in YARN client mode.

2018-07-08 Thread Marco Mistroni
You running on emr? You checked the emr logs?
Was in similar situation where job was stuck in accepted and then it
died..turned out to be an issue w. My code when running g with huge
data.perhaps try to reduce gradually the load til it works and then start
from there?
Not a huge help but I followed same when. My job was stuck on accepted
Hth

On Sun, Jul 8, 2018, 2:59 PM kant kodali  wrote:

> Hi All,
>
> I am trying to run a simple word count using YARN as a cluster manager.  I
> am currently using Spark 2.3.1 and Apache hadoop 2.7.3.  When I spawn
> spark-shell like below it gets stuck in ACCEPTED stated forever.
>
> ./bin/spark-shell --master yarn --deploy-mode client
>
>
> I set my log4j.properties in SPARK_HOME/conf to TRACE
>
>  queue: "default" name: "Spark shell" host: "N/A" rpc_port: -1
> yarn_application_state: ACCEPTED trackingUrl: "
> http://Kants-MacBook-Pro-2.local:8088/proxy/application_1531056583425_0001/;
> diagnostics: "" startTime: 1531056632496 finishTime: 0
> final_application_status: APP_UNDEFINED app_resource_Usage {
> num_used_containers: 0 num_reserved_containers: 0 used_resources { memory:
> 0 virtual_cores: 0 } reserved_resources { memory: 0 virtual_cores: 0 }
> needed_resources { memory: 0 virtual_cores: 0 } memory_seconds: 0
> vcore_seconds: 0 } originalTrackingUrl: "N/A" currentApplicationAttemptId {
> application_id { id: 1 cluster_timestamp: 1531056583425 } attemptId: 1 }
> progress: 0.0 applicationType: "SPARK" }}
>
> 18/07/08 06:32:22 INFO Client: Application report for
> application_1531056583425_0001 (state: ACCEPTED)
>
> 18/07/08 06:32:22 DEBUG Client:
>
> client token: N/A
>
> diagnostics: N/A
>
> ApplicationMaster host: N/A
>
> ApplicationMaster RPC port: -1
>
> queue: default
>
> start time: 1531056632496
>
> final status: UNDEFINED
>
> tracking URL:
> http://xxx-MacBook-Pro-2.local:8088/proxy/application_1531056583425_0001/
>
> user: xxx
>
>
>
> 18/07/08 06:32:20 DEBUG Client:
>
> client token: N/A
>
> diagnostics: N/A
>
> ApplicationMaster host: N/A
>
> ApplicationMaster RPC port: -1
>
> queue: default
>
> start time: 1531056632496
>
> final status: UNDEFINED
>
> tracking URL:
> http://Kants-MacBook-Pro-2.local:8088/proxy/application_1531056583425_0001/
>
> user: kantkodali
>
>
> 18/07/08 06:32:21 TRACE ProtobufRpcEngine: 1: Call -> /0.0.0.0:8032:
> getApplicationReport {application_id { id: 1 cluster_timestamp:
> 1531056583425 }}
>
> 18/07/08 06:32:21 DEBUG Client: IPC Client (1608805714) connection to /
> 0.0.0.0:8032 from kantkodali sending #136
>
> 18/07/08 06:32:21 DEBUG Client: IPC Client (1608805714) connection to /
> 0.0.0.0:8032 from kantkodali got value #136
>
> 18/07/08 06:32:21 DEBUG ProtobufRpcEngine: Call: getApplicationReport took
> 1ms
>
> 18/07/08 06:32:21 TRACE ProtobufRpcEngine: 1: Response <- /0.0.0.0:8032:
> getApplicationReport {application_report { applicationId { id: 1
> cluster_timestamp: 1531056583425 } user: "xxx" queue: "default" name:
> "Spark shell" host: "N/A" rpc_port: -1 yarn_application_state: ACCEPTED
> trackingUrl: "
> http://xxx-MacBook-Pro-2.local:8088/proxy/application_1531056583425_0001/;
> diagnostics: "" startTime: 1531056632496 finishTime: 0
> final_application_status: APP_UNDEFINED app_resource_Usage {
> num_used_containers: 0 num_reserved_containers: 0 used_resources { memory:
> 0 virtual_cores: 0 } reserved_resources { memory: 0 virtual_cores: 0 }
> needed_resources { memory: 0 virtual_cores: 0 } memory_seconds: 0
> vcore_seconds: 0 } originalTrackingUrl: "N/A" currentApplicationAttemptId {
> application_id { id: 1 cluster_timestamp: 1531056583425 } attemptId: 1 }
> progress: 0.0 applicationType: "SPARK" }}
>
> 18/07/08 06:32:21 INFO Client: Application report for
> application_1531056583425_0001 (state: ACCEPTED)
>
>
> I have read this link
> 
>  and
> here are the conf files that are different from default settings
>
>
> *yarn-site.xml*
>
>
> 
>
>
> 
>
> yarn.nodemanager.aux-services
>
> mapreduce_shuffle
>
> 
>
>
> 
>
> yarn.nodemanager.resource.memory-mb
>
> 16384
>
> 
>
>
> 
>
>yarn.scheduler.minimum-allocation-mb
>
>256
>
> 
>
>
> 
>
>yarn.scheduler.maximum-allocation-mb
>
>8192
>
> 
>
>
>
>
>yarn.nodemanager.resource.cpu-vcores
>
>8
>
>
>
>
> 
>
> *core-site.xml*
>
>
> 
>
> 
>
> fs.defaultFS
>
> hdfs://localhost:9000
>
> 
>
> 
>
> *hdfs-site.xml*
>
>
> 
>
> 
>
> dfs.replication
>
> 1
>
> 
>
> 
>
>
> you can imagine every other config remains untouched(so everything else
> has default settings) Finally, I have also tried to see if there any clues
> in resource manager logs but they dont seem to be helpful in terms of
> fixing the issue however I am newbie to yarn so please let me know if I
> missed out on something.
>
>
>
> 2018-07-08 06:54:57,345 INFO
> 

spark-shell gets stuck in ACCEPTED state forever when ran in YARN client mode.

2018-07-08 Thread kant kodali
Hi All,

I am trying to run a simple word count using YARN as a cluster manager.  I
am currently using Spark 2.3.1 and Apache hadoop 2.7.3.  When I spawn
spark-shell like below it gets stuck in ACCEPTED stated forever.

./bin/spark-shell --master yarn --deploy-mode client


I set my log4j.properties in SPARK_HOME/conf to TRACE

 queue: "default" name: "Spark shell" host: "N/A" rpc_port: -1
yarn_application_state: ACCEPTED trackingUrl: "
http://Kants-MacBook-Pro-2.local:8088/proxy/application_1531056583425_0001/;
diagnostics: "" startTime: 1531056632496 finishTime: 0
final_application_status: APP_UNDEFINED app_resource_Usage {
num_used_containers: 0 num_reserved_containers: 0 used_resources { memory:
0 virtual_cores: 0 } reserved_resources { memory: 0 virtual_cores: 0 }
needed_resources { memory: 0 virtual_cores: 0 } memory_seconds: 0
vcore_seconds: 0 } originalTrackingUrl: "N/A" currentApplicationAttemptId {
application_id { id: 1 cluster_timestamp: 1531056583425 } attemptId: 1 }
progress: 0.0 applicationType: "SPARK" }}

18/07/08 06:32:22 INFO Client: Application report for
application_1531056583425_0001 (state: ACCEPTED)

18/07/08 06:32:22 DEBUG Client:

client token: N/A

diagnostics: N/A

ApplicationMaster host: N/A

ApplicationMaster RPC port: -1

queue: default

start time: 1531056632496

final status: UNDEFINED

tracking URL:
http://xxx-MacBook-Pro-2.local:8088/proxy/application_1531056583425_0001/

user: xxx



18/07/08 06:32:20 DEBUG Client:

client token: N/A

diagnostics: N/A

ApplicationMaster host: N/A

ApplicationMaster RPC port: -1

queue: default

start time: 1531056632496

final status: UNDEFINED

tracking URL:
http://Kants-MacBook-Pro-2.local:8088/proxy/application_1531056583425_0001/

user: kantkodali


18/07/08 06:32:21 TRACE ProtobufRpcEngine: 1: Call -> /0.0.0.0:8032:
getApplicationReport {application_id { id: 1 cluster_timestamp:
1531056583425 }}

18/07/08 06:32:21 DEBUG Client: IPC Client (1608805714) connection to /
0.0.0.0:8032 from kantkodali sending #136

18/07/08 06:32:21 DEBUG Client: IPC Client (1608805714) connection to /
0.0.0.0:8032 from kantkodali got value #136

18/07/08 06:32:21 DEBUG ProtobufRpcEngine: Call: getApplicationReport took
1ms

18/07/08 06:32:21 TRACE ProtobufRpcEngine: 1: Response <- /0.0.0.0:8032:
getApplicationReport {application_report { applicationId { id: 1
cluster_timestamp: 1531056583425 } user: "xxx" queue: "default" name:
"Spark shell" host: "N/A" rpc_port: -1 yarn_application_state: ACCEPTED
trackingUrl: "
http://xxx-MacBook-Pro-2.local:8088/proxy/application_1531056583425_0001/;
diagnostics: "" startTime: 1531056632496 finishTime: 0
final_application_status: APP_UNDEFINED app_resource_Usage {
num_used_containers: 0 num_reserved_containers: 0 used_resources { memory:
0 virtual_cores: 0 } reserved_resources { memory: 0 virtual_cores: 0 }
needed_resources { memory: 0 virtual_cores: 0 } memory_seconds: 0
vcore_seconds: 0 } originalTrackingUrl: "N/A" currentApplicationAttemptId {
application_id { id: 1 cluster_timestamp: 1531056583425 } attemptId: 1 }
progress: 0.0 applicationType: "SPARK" }}

18/07/08 06:32:21 INFO Client: Application report for
application_1531056583425_0001 (state: ACCEPTED)


I have read this link

and
here are the conf files that are different from default settings


*yarn-site.xml*







yarn.nodemanager.aux-services

mapreduce_shuffle






yarn.nodemanager.resource.memory-mb

16384






   yarn.scheduler.minimum-allocation-mb

   256






   yarn.scheduler.maximum-allocation-mb

   8192




   

   yarn.nodemanager.resource.cpu-vcores

   8

   




*core-site.xml*






fs.defaultFS

hdfs://localhost:9000





*hdfs-site.xml*






dfs.replication

1






you can imagine every other config remains untouched(so everything else has
default settings) Finally, I have also tried to see if there any clues in
resource manager logs but they dont seem to be helpful in terms of fixing
the issue however I am newbie to yarn so please let me know if I missed out
on something.



2018-07-08 06:54:57,345 INFO
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated
new applicationId: 1

2018-07-08 06:55:09,413 WARN
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: The specific
max attempts: 0 for application: 1 is invalid, because it is out of the
range [1, 2]. Use the global max attempts instead.

2018-07-08 06:55:09,414 INFO
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application
with id 1 submitted by user xxx

2018-07-08 06:55:09,415 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing
application with id application_1531058076308_0001

2018-07-08 06:55:09,416 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=kantkodali