[ANNOUNCE] Apache Toree 0.2.0-incubating Released

2018-08-15 Thread Luciano Resende
Apache Toree is a kernel for the Jupyter Notebook platform providing
interactive and remote access to Apache Spark.

The Apache Toree community is pleased to announce the release of Apache
Toree 0.2.0-incubating which provides various bug fixes and the following
enhancements.

   * Support Apache Spark 2.x codebase including Spark 2.2.2
   * Enable Toree to run in Yarn cluster mode
   * Create spark context lazily to avoid long startup times for the kernel
   * Properly cleanup of temporary files/directories upon kernel shutdown
   * %AddJAR now supports HDFS file format
   * %AddDEP now defaults to default configuration
   * Cell Interrupt now cancel running Spark jobs and works in background
processes
   * Support configurable alternative interrupt signal via
--alternate-sigint command line
   * Interpreters now have the ability to send results other than text/plain

For more information about Apache Toree and go download the latest release
go to:

   https://toree.incubator.apache.org/

For more information on how to use Apache Toree please visit our
documentation page:

   https://toree.incubator.apache.org/docs/current/user/quick-start/

-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


JdbcRDD - schema always resolved as nullable=true

2018-08-15 Thread Subhash Sriram
Hi Spark Users,

We do a lot of processing in Spark using data that is in MS SQL server.
Today, I created a DataFrame against a table in SQL Server using the
following:

val dfSql=spark.read.jdbc(connectionString, table, props)

I noticed that every column in the DataFrame showed as *nullable=true, *even
though many of them are required.

I went hunting in the code, and I found that in JDBCRDD, when it resolves
the schema of a table, it passes in *alwaysNullable=true* to JdbcUtils,
which forces all columns to resolve as nullable.

https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L62

I don't see a way to change that functionality. Is this by design, or could
it be a bug?

Thanks!
Subhash


Re: Unable to see completed application in Spark 2 history web UI

2018-08-15 Thread Manu Zhang
If you are able to log onto the node where UI has been launched, then try
`ps -aux | grep HistoryServer` and the first column of output should be the
user.

On Wed, Aug 15, 2018 at 10:26 PM Fawze Abujaber  wrote:

> Thanks Manu, Do you know how i can see which user the UI is running,
> because i'm using cloudera manager and i created a user for cloudera
> manager and called it spark but this didn't solve me issue and here i'm
> trying to find out the user for the spark hisotry UI.
>
> On Wed, Aug 15, 2018 at 5:11 PM Manu Zhang 
> wrote:
>
>> Hi Fawze,
>>
>> A) The file permission is currently hard coded to 770 (
>> https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L287
>> ).
>> B) I think add all users (including UI) to the group like Spark will do.
>>
>>
>> On Wed, Aug 15, 2018 at 6:38 PM Fawze Abujaber  wrote:
>>
>>> Hi Manu,
>>>
>>> Thanks for your response.
>>>
>>> Yes, i see but still interesting to know how i can see these
>>> applications from the spark history UI.
>>>
>>> How i can know with which user i'm  logged in when i'm navigating the
>>> spark history UI.
>>>
>>> The Spark process is running with cloudera-scm and the events written in
>>> the spark2history folder at the HDFS written with the user name who is
>>> running the application and group spark (770 permissions).
>>>
>>> I'm interesting to see if i can force these logs to be written with 774
>>> or 775 permission or finding another solutions that enable Rnd or anyone to
>>> be able to investigate his application logs using the UI.
>>>
>>> for example : can i use such spark conf : spark.eventLog.permissions=755
>>>
>>> The 2 options i see here:
>>>
>>> A) find a way to enforce these logs to be written with other permissions.
>>>
>>> B) Find the user that the UI running with as creating LDAP groups and
>>> user that can handle this.
>>>
>>> for example creating group called Spark and create the user that the UI
>>> running with and add this user to the spark group.
>>> not sure if this option will work as i don't know if these steps
>>> authenticate against the LDAP.
>>>
>>
>
> --
> Take Care
> Fawze Abujaber
>


java.lang.UnsupportedOperationException: No Encoder found for Set[String]

2018-08-15 Thread V0lleyBallJunki3
Hello,
  I am using Spark 2.2.2 with Scala 2.11.8. I wrote a short program

val spark = SparkSession.builder().master("local[4]").getOrCreate()

case class TestCC(i: Int, ss: Set[String])

import spark.implicits._
import spark.sqlContext.implicits._

val testCCDS = Seq(TestCC(1,Set("SS","Salil")), TestCC(2, Set("xx",
"XYZ"))).toDS()


I get :
java.lang.UnsupportedOperationException: No Encoder found for Set[String]
- field (class: "scala.collection.immutable.Set", name: "ss")
- root class: "TestCC"
  at
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:632)
  at
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:455)
  at
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
  at
org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:809)
  at
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
  at
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:455)
  at
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1$$anonfun$10.apply(ScalaReflection.scala:626)
  at
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1$$anonfun$10.apply(ScalaReflection.scala:614)
  at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)

To the best of my knowledge implicit support for Set has been added in Spark
2.2. Am I missing something?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



from_json schema order

2018-08-15 Thread Brandon Geise
Hi,

 

Can someone confirm whether ordering matters between the schema and underlying 
JSON string?

 

Thanks,
Brandon

 

 



Dynamic Allocation not removing executors

2018-08-15 Thread Maximiliano Patricio Méndez
Hi,

I found an issue trying to use dynamic allocation in 2.3.1 where the driver
does not remove idle executors under some circunstances.

For the first instance of this happening, it seems that a change introduced
in 2.2.1/2.3.0 (SPARK-21656
) added a check
 on the
ExecutorAllocationManager that causes the first remove request to be
ignored if there are no pending tasks and the initialExecutors property is
set != 0 (the initializing flag prevents the numExecutorsTarget number to
be changed)

My dynamic allocation conf:
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.initialExecutors 4
spark.dynamicAllocation.minExecutors 0
spark.dynamicAllocation.maxExecutors 100

This normalizes after the first submitted job, but may leave up to 4
executors (in our case) idle without being remove if no job is ever
submitted.

Logs:
18/08/15 13:08:44 DEBUG ExecutorAllocationManager: Starting idle timer for
3 because there are no more tasks scheduled to run on the executor (to
expire in 60 seconds)
18/08/15 13:08:44 INFO ExecutorAllocationManager: New executor 3 has
registered (new total is 1)
18/08/15 13:08:45 DEBUG ExecutorAllocationManager: Starting idle timer for
0 because there are no more tasks scheduled to run on the executor (to
expire in 60 seconds)
18/08/15 13:08:45 INFO ExecutorAllocationManager: New executor 0 has
registered (new total is 2)
18/08/15 13:08:45 DEBUG ExecutorAllocationManager: Starting idle timer for
1 because there are no more tasks scheduled to run on the executor (to
expire in 60 seconds)
18/08/15 13:08:45 INFO ExecutorAllocationManager: New executor 1 has
registered (new total is 3)
18/08/15 13:08:46 DEBUG ExecutorAllocationManager: Starting idle timer for
2 because there are no more tasks scheduled to run on the executor (to
expire in 60 seconds)
18/08/15 13:08:46 INFO ExecutorAllocationManager: New executor 2 has
registered (new total is 4)
18/08/15 13:09:44 INFO ExecutorAllocationManager: Request to remove
executorIds: 3
18/08/15 13:09:44 DEBUG ExecutorAllocationManager: Not removing idle
executor 3 because there are only 4 executor(s) left (number of executor
target 4)
18/08/15 13:09:45 DEBUG ExecutorAllocationManager: Lowering target number
of executors to 0 (previously 4) because not all requested executors are
actually needed
18/08/15 13:09:45 INFO ExecutorAllocationManager: Request to remove
executorIds: 0
18/08/15 13:09:45 INFO ExecutorAllocationManager: Removing executor 0
because it has been idle for 60 seconds (new desired total will be 3)
18/08/15 13:09:45 INFO ExecutorAllocationManager: Request to remove
executorIds: 1
18/08/15 13:09:45 INFO ExecutorAllocationManager: Removing executor 1
because it has been idle for 60 seconds (new desired total will be 2)
18/08/15 13:09:46 INFO ExecutorAllocationManager: Existing executor 0 has
been removed (new total is 3)
18/08/15 13:09:46 DEBUG ExecutorAllocationManager: Executor 0 is no longer
pending to be removed (1 left)
18/08/15 13:09:46 INFO ExecutorAllocationManager: Request to remove
executorIds: 2
18/08/15 13:09:46 INFO ExecutorAllocationManager: Removing executor 2
because it has been idle for 60 seconds (new desired total will be 1)
18/08/15 13:09:46 INFO ExecutorAllocationManager: Existing executor 1 has
been removed (new total is 2)
18/08/15 13:09:46 DEBUG ExecutorAllocationManager: Executor 1 is no longer
pending to be removed (1 left)
18/08/15 13:09:46 INFO ExecutorAllocationManager: Existing executor 2 has
been removed (new total is 1)
18/08/15 13:09:46 DEBUG ExecutorAllocationManager: Executor 2 is no longer
pending to be removed (0 left)


Re: from_json function

2018-08-15 Thread Maxim Gekk
Hello Denis,

The from_json function supports only the fail fast mode, see:
https://github.com/apache/spark/blob/e2ab7deae76d3b6f41b9ad4d0ece14ea28db40ce/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L568

Your settings "mode" -> "PERMISSIVE" will be overwritten

On Wed, Aug 15, 2018 at 4:52 PM dbolshak  wrote:

> Hello community,
>
> I can not manage to run from_json method with "columnNameOfCorruptRecord"
> option.
> ```
> import org.apache.spark.sql.functions._
>
> val data = Seq(
>   "{'number': 1}",
>   "{'number': }"
> )
>
> val schema = new StructType()
>   .add($"number".int)
>   .add($"_corrupt_record".string)
>
> val sourceDf = data.toDF("column")
>
> val jsonedDf = sourceDf
>   .select(from_json(
> $"column",
> schema,
> Map("mode" -> "PERMISSIVE", "columnNameOfCorruptRecord" ->
> "_corrupt_record")
>   ) as "data").selectExpr("data.number", "data._corrupt_record")
>
>   jsonedDf.show()
> ```
> Does anybody can help me get `_corrupt_record` non empty?
>
> Thanks in advance.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 

Maxim Gekk

Technical Solutions Lead

Databricks Inc.

maxim.g...@databricks.com

databricks.com

  


[K8S] Spark initContainer custom bootstrap support for Spark master

2018-08-15 Thread Li Gao
Hi,

We've noticed on the latest Master (not Spark 2.3.1 branch), the support
for Kubernetes initContainer is no longer there. What would be the path
forward if we need to do custom bootstrap actions (i.e. run additional
scripts) prior to driver/executor container entering running mode?

Thanks,
Li


Shuffle uses Direct Memory Buffer even after setting "spark.shuffle.io.preferDirectBufs = false"

2018-08-15 Thread Vaibhav Kulkarni
Hi,


I am using Standalone Spark 2.3 and have a question regarding Shuffle.
Going by the documentation, default Shuffle behaviour is to use Direct
Memory buffers. But, even after I set the following parameter, I notice
Shuffle still uses Direct Memory buffers.


spark.shuffle.io.preferDirectBufs = false


Is this a bug ? How can I disable use of Direct Memory for Shuffle


Thanks,

Vaibhav Kulkarni


from_json function

2018-08-15 Thread dbolshak
Hello community,

I can not manage to run from_json method with "columnNameOfCorruptRecord"
option.
```
import org.apache.spark.sql.functions._

val data = Seq(
  "{'number': 1}",
  "{'number': }"
)

val schema = new StructType()
  .add($"number".int)
  .add($"_corrupt_record".string)

val sourceDf = data.toDF("column")

val jsonedDf = sourceDf
  .select(from_json(
$"column",
schema,
Map("mode" -> "PERMISSIVE", "columnNameOfCorruptRecord" ->
"_corrupt_record")
  ) as "data").selectExpr("data.number", "data._corrupt_record")

  jsonedDf.show()
```
Does anybody can help me get `_corrupt_record` non empty?

Thanks in advance.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Unable to see completed application in Spark 2 history web UI

2018-08-15 Thread Fawze Abujaber
Thanks Manu, Do you know how i can see which user the UI is running,
because i'm using cloudera manager and i created a user for cloudera
manager and called it spark but this didn't solve me issue and here i'm
trying to find out the user for the spark hisotry UI.

On Wed, Aug 15, 2018 at 5:11 PM Manu Zhang  wrote:

> Hi Fawze,
>
> A) The file permission is currently hard coded to 770 (
> https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L287
> ).
> B) I think add all users (including UI) to the group like Spark will do.
>
>
> On Wed, Aug 15, 2018 at 6:38 PM Fawze Abujaber  wrote:
>
>> Hi Manu,
>>
>> Thanks for your response.
>>
>> Yes, i see but still interesting to know how i can see these applications
>> from the spark history UI.
>>
>> How i can know with which user i'm  logged in when i'm navigating the
>> spark history UI.
>>
>> The Spark process is running with cloudera-scm and the events written in
>> the spark2history folder at the HDFS written with the user name who is
>> running the application and group spark (770 permissions).
>>
>> I'm interesting to see if i can force these logs to be written with 774
>> or 775 permission or finding another solutions that enable Rnd or anyone to
>> be able to investigate his application logs using the UI.
>>
>> for example : can i use such spark conf : spark.eventLog.permissions=755
>>
>> The 2 options i see here:
>>
>> A) find a way to enforce these logs to be written with other permissions.
>>
>> B) Find the user that the UI running with as creating LDAP groups and
>> user that can handle this.
>>
>> for example creating group called Spark and create the user that the UI
>> running with and add this user to the spark group.
>> not sure if this option will work as i don't know if these steps
>> authenticate against the LDAP.
>>
>

-- 
Take Care
Fawze Abujaber


Re: Unable to see completed application in Spark 2 history web UI

2018-08-15 Thread Manu Zhang
Hi Fawze,

A) The file permission is currently hard coded to 770 (
https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L287
).
B) I think add all users (including UI) to the group like Spark will do.


On Wed, Aug 15, 2018 at 6:38 PM Fawze Abujaber  wrote:

> Hi Manu,
>
> Thanks for your response.
>
> Yes, i see but still interesting to know how i can see these applications
> from the spark history UI.
>
> How i can know with which user i'm  logged in when i'm navigating the
> spark history UI.
>
> The Spark process is running with cloudera-scm and the events written in
> the spark2history folder at the HDFS written with the user name who is
> running the application and group spark (770 permissions).
>
> I'm interesting to see if i can force these logs to be written with 774 or
> 775 permission or finding another solutions that enable Rnd or anyone to be
> able to investigate his application logs using the UI.
>
> for example : can i use such spark conf : spark.eventLog.permissions=755
>
> The 2 options i see here:
>
> A) find a way to enforce these logs to be written with other permissions.
>
> B) Find the user that the UI running with as creating LDAP groups and user
> that can handle this.
>
> for example creating group called Spark and create the user that the UI
> running with and add this user to the spark group.
> not sure if this option will work as i don't know if these steps
> authenticate against the LDAP.
>


Java API for statistics of spark job running on yarn

2018-08-15 Thread Serkan TAS
Hi all,

I am facing and issue for long running spark job on yarn. If there occures some 
bottle neck on hdfs and/or kafka, active batch count increases immidiately.

I am plannning to check the active batch count with java client and create 
alarms for the operations group.

So, is it possible to retrieve active batch count with any java api as we can 
see on monitoring page below ?

/proxy/application_1534314004365_0001/streaming/

Regards,

Serkan






ENERJİSA


serkan@enerjisa.com
www.enerjisa.com.tr

[Description: Description: Açıklama: 
Tick-Tock-Boom-Facebook] [Description: 
Description: Açıklama: Tick-Tock-Boom-GooglePlus] 
  [Description: 
Description: Açıklama: Tick-Tock-Boom-Youtube] 



[cid:image5d27ab.JPG@08344e31.439da30d]






Bu ileti hukuken korunmuş, gizli veya ifşa edilmemesi gereken bilgiler 
içerebilir. Şayet mesajın gönderildiği kişi değilseniz, bu iletiyi çoğaltmak ve 
dağıtmak yasaktır. Bu mesajı yanlışlıkla alan kişi, bu durumu derhal gönderene 
telefonla ya da e-posta ile bildirmeli ve bilgisayarından silmelidir. Bu 
iletinin içeriğinden yalnızca iletiyi gönderen kişi sorumludur.

This communication may contain information that is legally privileged, 
confidential or exempt from disclosure. If you are not the intended recipient, 
please note that any dissemination, distribution, or copying of this 
communication is strictly prohibited. Anyone who receives this message in error 
should notify the sender immediately by telephone or by return communication 
and delete it from his or her computer. Only the person who has sent this 
message is responsible for its content.


spark driver pod stuck in Waiting: PodInitializing state in Kubernetes

2018-08-15 Thread purna pradeep
im running Spark 2.3 job on kubernetes cluster

kubectl version

Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3",
GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean",
BuildDate:"2018-02-09T21:51:06Z", GoVersion:"go1.9.4", Compiler:"gc",
Platform:"darwin/amd64"}

Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.3",
GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd", GitTreeState:"clean",
BuildDate:"2017-11-08T18:27:48Z", GoVersion:"go1.8.3", Compiler:"gc",
Platform:"linux/amd64"}



when i ran spark submit on k8s master the driver pod is stuck in Waiting:
PodInitializing state.
I had to manually kill the driver pod and submit new job in this case ,then
it works.


This is happening if i submit the jobs almost parallel ie submit 5 jobs one
after the other simultaneously.

I'm running spark jobs on 20 nodes each having below configuration

I tried kubectl describe node on the node where trhe driver pod is running
this is what i got ,i do see there is overcommit on resources but i
expected kubernetes scheduler not to schedule if resources in node are
overcommitted or node is in Not Ready state ,in this case node is in Ready
State but i observe same behaviour if node is in "Not Ready" state



Name:   **

Roles:  worker

Labels: beta.kubernetes.io/arch=amd64

beta.kubernetes.io/os=linux

kubernetes.io/hostname=

node-role.kubernetes.io/worker=true

Annotations:node.alpha.kubernetes.io/ttl=0


volumes.kubernetes.io/controller-managed-attach-detach=true

Taints: 

CreationTimestamp:  Tue, 31 Jul 2018 09:59:24 -0400

Conditions:

  Type Status  LastHeartbeatTime
LastTransitionTimeReason   Message

   --  -
----   ---

  OutOfDiskFalse   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31
Jul 2018 09:59:24 -0400   KubeletHasSufficientDisk kubelet has
sufficient disk space available

  MemoryPressure   False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31
Jul 2018 09:59:24 -0400   KubeletHasSufficientMemory   kubelet has
sufficient memory available

  DiskPressure False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31
Jul 2018 09:59:24 -0400   KubeletHasNoDiskPressure kubelet has no disk
pressure

  ReadyTrueTue, 14 Aug 2018 09:31:20 -0400   Sat, 11
Aug 2018 00:41:27 -0400   KubeletReady kubelet is posting
ready status. AppArmor enabled

Addresses:

  InternalIP:  *

  Hostname:**

Capacity:

 cpu: 16

 memory:  125827288Ki

 pods:110

Allocatable:

 cpu: 16

 memory:  125724888Ki

 pods:110

System Info:

 Machine ID: *

 System UUID:**

 Boot ID:1493028d-0a80-4f2f-b0f1-48d9b8910e9f

 Kernel Version: 4.4.0-1062-aws

 OS Image:   Ubuntu 16.04.4 LTS

 Operating System:   linux

 Architecture:   amd64

 Container Runtime Version:  docker://Unknown

 Kubelet Version:v1.8.3

 Kube-Proxy Version: v1.8.3

PodCIDR: **

ExternalID:  **

Non-terminated Pods: (11 in total)

  Namespace  Name
 CPU Requests  CPU Limits  Memory Requests  Memory
Limits

  -  
   --  ---
 -

  kube-systemcalico-node-gj5mb
  250m (1%) 0 (0%)  0 (0%)   0 (0%)

  kube-system
 kube-proxy- 100m (0%)
0 (0%)  0 (0%)   0 (0%)

  kube-systemprometheus-prometheus-node-exporter-9cntq
  100m (0%) 200m (1%)   30Mi (0%)50Mi (0%)

  logging
 elasticsearch-elasticsearch-data-69df997486-gqcwg   400m (2%)
1 (6%)  8Gi (6%) 16Gi (13%)

  loggingfluentd-fluentd-elasticsearch-tj7nd
  200m (1%) 0 (0%)  612Mi (0%)   0 (0%)

  rook   rook-agent-6jtzm
 0 (0%)0 (0%)  0 (0%)   0 (0%)

  rook
rook-ceph-osd-10-6-42-250.accel.aws-cardda.cb4good.com-gwb8j0 (0%)
   0 (0%)  0 (0%)   0 (0%)

  spark
 accelerate-test-5-a3bfb8a597e83d459193a183e17f13b5-exec-1   2 (12%)
0 (0%)  10Gi (8%)12Gi (10%)

  spark
 accelerate-testing-1-8ed0482f3bfb3c0a83da30bb7d433dff-exec-52 (12%)
0 (0%)  10Gi (8%)12Gi 

Re: Unable to see completed application in Spark 2 history web UI

2018-08-15 Thread Fawze Abujaber
Hi Manu,

Thanks for your response.

Yes, i see but still interesting to know how i can see these applications
from the spark history UI.

How i can know with which user i'm  logged in when i'm navigating the spark
history UI.

The Spark process is running with cloudera-scm and the events written in
the spark2history folder at the HDFS written with the user name who is
running the application and group spark (770 permissions).

I'm interesting to see if i can force these logs to be written with 774 or
775 permission or finding another solutions that enable Rnd or anyone to be
able to investigate his application logs using the UI.

for example : can i use such spark conf : spark.eventLog.permissions=755

The 2 options i see here:

A) find a way to enforce these logs to be written with other permissions.

B) Find the user that the UI running with as creating LDAP groups and user
that can handle this.

for example creating group called Spark and create the user that the UI
running with and add this user to the spark group.
not sure if this option will work as i don't know if these steps
authenticate against the LDAP.


Re: Unable to see completed application in Spark 2 history web UI

2018-08-15 Thread Manu Zhang
Hi Fawze,

In Spark 2.3, HistoryServer will check for file permissions when reading
event logs written by your applications. (Please check
https://issues.apache.org/jira/browse/SPARK-20172). With file permissions
of 770, HistoryServer is not permitted to read the event log. That's why
you were able to see applications once changing file permissions to 777.

Regards,
Manu Zhang

On Mon, Aug 13, 2018 at 4:53 PM Fawze Abujaber  wrote:

> Hi Guys,
>
> Any help here?
>
> On Wed, Aug 8, 2018 at 7:56 AM Fawze Abujaber  wrote:
>
>> Hello Community,
>>
>> I'm using Spark 2.3 and Spark 1.6.0 in my cluster with Cloudera
>> distribution 5.13.0.
>>
>> Both are configured to run on Yarn, but i'm unable to see completed
>> application in Spark2 history server, while in Spark 1.6.0 i did.
>>
>> 1) I checked the HDFS permissions for both folders and both have the same
>> permissions.
>>
>> drwxrwxrwt   - cloudera-scm spark  0 2018-08-08 00:46
>> /user/spark/applicationHistory
>> drwxrwxrwt   - cloudera-scm spark  0 2018-08-08 00:46
>> /user/spark/spark2ApplicationHistory
>>
>> The applications file itself running with permissions 770 in both.
>>
>> -rwxrwx---   3  fawzea spark 4743751 2018-08-07 23:32
>> /user/spark/spark2ApplicationHistory/application_1527404701551_672816_1
>> -rwxrwx---   3  fawzea spark   134315 2018-08-08 00:41
>> /user/spark/applicationHistory/application_1527404701551_673359_1
>>
>> 2) No error in the Spark2 history server log.
>>
>> 3) Compared the configurations between Spark 1.6 and Spark 2.3 like
>> system user, enable log, etc ... all looks the same.
>>
>> 4) Once i changed the permissions for the above Spark2 applications to
>> 777, i was able to see the application in the spark2 history server UI.
>>
>> Tried to figure out if the 2 Sparks UIs running with different users but
>> was unable to find it.
>>
>> Anyone who ran into this issue and solved it?
>>
>> Thanks in advance.
>>
>>
>> --
>> Take Care
>> Fawze Abujaber
>>
>
>
> --
> Take Care
> Fawze Abujaber
>