Spark job terminated without any errors

2018-05-18 Thread karthikjay
We have created multiples spark jobs (as far JAR) and run it using
spark-submit in a nohup mode. Most of the jobs quits after a while. We tried
to harness the logs for failures but the only message that gave us some clue
was "18/05/07 18:31:38 INFO Worker: Executor app-20180507180436-0016/0
finished with state KILLED exitStatus 143" Any help appreciated.

Worker and master logs -
https://gist.github.com/karthik-arris/57804a2b80a8a89754578ab308084b48#file-master-log




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: OneHotEncoderEstimator - java.lang.NoSuchMethodError: org.apache.spark.sql.Dataset.withColumns

2018-05-18 Thread Bryan Cutler
The example works for me, please check your environment and ensure you are
using Spark 2.3.0 where OneHotEncoderEstimator was introduced.

On Fri, May 18, 2018 at 12:57 AM, Matteo Cossu  wrote:

> Hi,
>
> are you sure Dataset has a method withColumns?
>
> On 15 May 2018 at 16:58, Mina Aslani  wrote:
>
>> Hi,
>>
>> I get below error when I try to run oneHotEncoderEstimator example.
>> https://github.com/apache/spark/blob/b74366481cc87490adf4e69
>> d26389ec737548c15/examples/src/main/java/org/apache/
>> spark/examples/ml/JavaOneHotEncoderEstimatorExample.java#L67
>>
>> Which is this line of the code:
>> https://github.com/apache/spark/blob/master/mllib/src/main/
>> scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala#L348
>>
>> Exception in thread "streaming-job-executor-0" java.lang.NoSuchMethodError: 
>> org.apache.spark.sql.Dataset.withColumns(Lscala/collection/Seq;Lscala/collection/Seq;)Lorg/apache/spark/sql/Dataset;
>>  at 
>> org.apache.spark.ml.feature.OneHotEncoderModel.transform(OneHotEncoderEstimator.scala:348)
>>
>>
>> Can you please let me know, what is the cause? Any workaround?
>>
>> Seeing the example in the repo, looks like that at some point it used be
>> running fine. And, now it's not working. Also, oneHotEncoder is deprecated.
>>
>> I really appreciate your quick response.
>>
>> Regards,
>> Mina
>>
>>
>


RDD does not have sc error

2018-05-18 Thread Chao Fang
Hello,

 

Today I used SparkSession.read.format(“HBASETABLE”).options.(“zk”,”
zkaddress”).load() API to create a dataset from HBase data source and of
course I write code to extends BaseRelation and PrunedFilteredScan to
provide Logical plan for this HBase data source.

 

I use InputFormat to create my NewHadoopRDD, each partition of which maps to
a region of this HBase table. For each partition, I transform each HBase
Result into Row.

 

So I think the lineage of this RDD is quite clear,but when I try to take a
peek of this dataset by using dataset.first or dataset.count, there exists
an error.

 

This error tells me that this RDD does not have sc with this follow message:

private def sc: SparkContext = {
  if (_sc == null) {
throw new SparkException(
  "RDD transformations and actions can only be invoked by the driver,
not inside of other " +
  "transformations; for example, rdd1.map(x => rdd2.values.count() * x)
is invalid because " +
  "the values transformation and count action cannot be performed inside
of the rdd1.map " +
  "transformation. For more information, see SPARK-5063.")
  }
  _sc
}

 

In addition, I submit my spark app to a standalone cluster AND it seems that
I did’t put transport/action inside of other transformation as the above
errer message said.

 

Does any one have some suggestion? Great thanks.

 

Chao,



Spark on YARN in client-mode: do we need 1 vCore for the AM?

2018-05-18 Thread peay
Hello,

I run a Spark cluster on YARN, and we have a bunch of client-mode applications 
we use for interactive work. Whenever we start one of this, an application 
master container is started.

My understanding is that this is mostly an empty shell, used to request further 
containers or get status from YARN. Is that correct?

spark.yarn.am.cores is 1, and that AM gets one full vCore on the cluster. 
Because I am using DominantResourceCalculator to take vCores into account for 
scheduling, this results in a lot of unused CPU capacity overall because all 
those AMs each block one full vCore. With enough jobs, this adds up quickly.

I am trying to understand if we can work around that -- ideally, by allocating 
fractional vCores (e.g., give 100 millicores to the AM), or by allocating no 
vCores at all for the AM (I am fine with a bit of oversubscription because of 
that).

Any idea on how to avoid blocking so many YARN vCores just for the Spark AMs?

Thanks!

RE: How to Spark can solve this example

2018-05-18 Thread JUNG YOUSUN
How about Structured Streaming with Kafka? It is possible to operate through 
window time. For more information, see here 
https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html

Sincerely,
Yousun Jeong

From: Matteo Cossu 
Sent: Friday, May 18, 2018 4:51 PM
To: Esa Heikkinen 
Cc: user@spark.apache.org
Subject: Re: How to Spark can solve this example

Hello Esa,
all the steps that you described can be performed with Spark. I don't know 
about CEP, but Spark Streaming should be enough.

Best,

Matteo

On 18 May 2018 at 09:20, Esa Heikkinen 
> wrote:
Hi

I have attached fictive example (pdf-file) about processing of event traces 
from data streams (or batch data). I hope the picture of the attachment is 
clear and understandable.

I would be very interested in how best to solve it with Spark. Or it is 
possible or not ? If it is possible, can it be solved for example by CEP ?

Little explanations.. Data processing reads three different and parallel 
streams (or batch data): A, B and C. Each of them have events which have 
different “keys with value” (like K1-K4) or record.

I would want to find all event traces, which have certain dependences or 
patterns between streams (or batches). To find pattern there are three steps:

1)  Searches an event that have value “X” in K1 in stream A and if it is 
found, stores it to global data for later use and continues next step

2)  Searches an event that have value A(K1) in K2 in stream B and if it is 
found, stores it to global data for later use and continues next step

3)  Searches an event that have value A(K1) in K1 and value B(K3) in K2 in 
stream C and if it is found, continues next step (back to step 1)

If that is not possible by Spark, do you have any idea of tools, which can 
solve this ?

Best, Esa



-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org



RE: How to Spark can solve this example

2018-05-18 Thread Esa Heikkinen
Hello

That is good to hear, but are there exist some good practical (Python or Scala) 
examples ? This would help a lot.

I tried to do that by Apache Flink (and its CEP) and it was not so piece cake.

Best, Esa

From: Matteo Cossu 
Sent: Friday, May 18, 2018 10:51 AM
To: Esa Heikkinen 
Cc: user@spark.apache.org
Subject: Re: How to Spark can solve this example

Hello Esa,
all the steps that you described can be performed with Spark. I don't know 
about CEP, but Spark Streaming should be enough.

Best,

Matteo

On 18 May 2018 at 09:20, Esa Heikkinen 
> wrote:
Hi

I have attached fictive example (pdf-file) about processing of event traces 
from data streams (or batch data). I hope the picture of the attachment is 
clear and understandable.

I would be very interested in how best to solve it with Spark. Or it is 
possible or not ? If it is possible, can it be solved for example by CEP ?

Little explanations.. Data processing reads three different and parallel 
streams (or batch data): A, B and C. Each of them have events which have 
different “keys with value” (like K1-K4) or record.

I would want to find all event traces, which have certain dependences or 
patterns between streams (or batches). To find pattern there are three steps:

1)  Searches an event that have value “X” in K1 in stream A and if it is 
found, stores it to global data for later use and continues next step

2)  Searches an event that have value A(K1) in K2 in stream B and if it is 
found, stores it to global data for later use and continues next step

3)  Searches an event that have value A(K1) in K1 and value B(K3) in K2 in 
stream C and if it is found, continues next step (back to step 1)

If that is not possible by Spark, do you have any idea of tools, which can 
solve this ?

Best, Esa



-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org



Re: OneHotEncoderEstimator - java.lang.NoSuchMethodError: org.apache.spark.sql.Dataset.withColumns

2018-05-18 Thread Matteo Cossu
Hi,

are you sure Dataset has a method withColumns?

On 15 May 2018 at 16:58, Mina Aslani  wrote:

> Hi,
>
> I get below error when I try to run oneHotEncoderEstimator example.
> https://github.com/apache/spark/blob/b74366481cc87490adf4e69d26389e
> c737548c15/examples/src/main/java/org/apache/spark/examples/ml/
> JavaOneHotEncoderEstimatorExample.java#L67
>
> Which is this line of the code:
> https://github.com/apache/spark/blob/master/mllib/src/
> main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala#L348
>
> Exception in thread "streaming-job-executor-0" java.lang.NoSuchMethodError: 
> org.apache.spark.sql.Dataset.withColumns(Lscala/collection/Seq;Lscala/collection/Seq;)Lorg/apache/spark/sql/Dataset;
>   at 
> org.apache.spark.ml.feature.OneHotEncoderModel.transform(OneHotEncoderEstimator.scala:348)
>
>
> Can you please let me know, what is the cause? Any workaround?
>
> Seeing the example in the repo, looks like that at some point it used be
> running fine. And, now it's not working. Also, oneHotEncoder is deprecated.
>
> I really appreciate your quick response.
>
> Regards,
> Mina
>
>


Understanding the results from Spark's KMeans clustering object

2018-05-18 Thread shubham
Hello Everyone,

I am performing clustering on a dataset using PySpark. To find the number of
clusters I performed clustering over a range of values (2,20) and found the
wsse (within-cluster sum of squares) values for each value of k. This where
I found something unusual. According to my understanding when you increase
the number of clusters, the wsse decreases monotonically. But results I got
say otherwise. I 'm displaying wsse for first few clusters only

Results from spark

For k = 002 WSSE is 255318.793358
For k = 003 WSSE is 209788.479560
For k = 004 WSSE is 208498.351074
For k = 005 WSSE is 142573.272672
For k = 006 WSSE is 154419.027612
For k = 007 WSSE is 115092.404604
For k = 008 WSSE is 104753.205635
For k = 009 WSSE is 98000.985547
For k = 010 WSSE is 95134.137071
If you look at the wsse value of for k=5 and k=6, you'll see the wsse has
increased. I turned to sklearn to see if I get similar results. The codes I
used for spark and sklearn are in the appendix section towards the end of
the post. I have tried to use same values for the parameters in spark and
sklearn KMeans model. The following are the results from sklearn and they
are as I expected them to be - monotonically decreasing.

Results from sklearn

For k = 002 WSSE is 245090.224247
For k = 003 WSSE is 201329.888159
For k = 004 WSSE is 166889.044195
For k = 005 WSSE is 142576.895154
For k = 006 WSSE is 123882.070776
For k = 007 WSSE is 112496.692455
For k = 008 WSSE is 102806.001664
For k = 009 WSSE is 95279.837212
For k = 010 WSSE is 89303.574467
I am not sure as to why I the wsse values increase in Spark. I tried using
different datasets and found similar behavior there as well. Is there
someplace I am going wrong? Any clues would be great.

APPENDIX
The dataset is located here.

Read the data and set declare variables

# get data
import pandas as pd
url =
"https://raw.githubusercontent.com/vectosaurus/bb_lite/master/3.0%20data/adult_comp_cont.csv;

df_pandas = pd.read_csv(url)
df_spark = sqlContext(df_pandas)
target_col = 'high_income'
numeric_cols = [i for i in df_pandas.columns if i !=target_col]

k_min = 2 # 2 in inclusive
k_max = 21 # 2i is exlusive. will fit till 20

max_iter = 1000
seed = 42
This is the code I am using for getting the sklearn results:

from sklearn.cluster import KMeans as KMeans_SKL
from sklearn.preprocessing import StandardScaler as StandardScaler_SKL

ss = StandardScaler_SKL(with_std=True, with_mean=True)
ss.fit(df_pandas.loc[:, numeric_cols])
df_pandas_scaled = pd.DataFrame(ss.transform(df_pandas.loc[:,
numeric_cols]))

wsse_collect = []

for i in range(k_min, k_max):
km = KMeans_SKL(random_state=seed, max_iter=max_iter, n_clusters=i)
_ = km.fit(df_pandas_scaled)
wsse = km.inertia_
print('For k = {i:03d} WSSE is {wsse:10f}'.format(i=i, wsse=wsse))
wsse_collect.append(wsse)
This is the code I am using for getting the spark results

from pyspark.ml.feature import StandardScaler, VectorAssembler
from pyspark.ml.clustering import KMeans

standard_scaler_inpt_features = 'ss_features'
kmeans_input_features = 'features'
kmeans_prediction_features = 'prediction'


assembler = VectorAssembler(inputCols=numeric_cols,
outputCol=standard_scaler_inpt_features)
assembled_df = assembler.transform(df_spark)

scaler = StandardScaler(inputCol=standard_scaler_inpt_features,
outputCol=kmeans_input_features, withStd=True, withMean=True)
scaler_model = scaler.fit(assembled_df)
scaled_data = scaler_model.transform(assembled_df)

wsse_collect_spark = []

for i in range(k_min, k_max):
km = KMeans(featuresCol=kmeans_input_features,
predictionCol=kmeans_prediction_col,
k=i, maxIter=max_iter, seed=seed)
km_fit = km.fit(scaled_data)
wsse_spark = km_fit.computeCost(scaled_data)
wsse_collect_spark .append(wsse_spark)
print('For k = {i:03d} WSSE is {wsse:10f}'.format(i=i, wsse=wsse_spark))




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to Spark can solve this example

2018-05-18 Thread Matteo Cossu
Hello Esa,
all the steps that you described can be performed with Spark. I don't know
about CEP, but Spark Streaming should be enough.

Best,

Matteo

On 18 May 2018 at 09:20, Esa Heikkinen  wrote:

> Hi
>
>
>
> I have attached fictive example (pdf-file) about processing of event
> traces from data streams (or batch data). I hope the picture of the
> attachment is clear and understandable.
>
>
>
> I would be very interested in how best to solve it with Spark. Or it is
> possible or not ? If it is possible, can it be solved for example by CEP ?
>
>
>
> Little explanations.. Data processing reads three different and parallel
> streams (or batch data): A, B and C. Each of them have events which have
> different “keys with value” (like K1-K4) or record.
>
>
>
> I would want to find all event traces, which have certain dependences or
> patterns between streams (or batches). To find pattern there are three
> steps:
>
> 1)  Searches an event that have value “X” in K1 in stream A and if it
> is found, stores it to global data for later use and continues next step
>
> 2)  Searches an event that have value A(K1) in K2 in stream B and if
> it is found, stores it to global data for later use and continues next step
>
> 3)  Searches an event that have value A(K1) in K1 and value B(K3) in
> K2 in stream C and if it is found, continues next step (back to step 1)
>
>
>
> If that is not possible by Spark, do you have any idea of tools, which can
> solve this ?
>
>
>
> Best, Esa
>
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>


Exception thrown in awaitResult during application launch in yarn cluster

2018-05-18 Thread Shiyuan
Hi Spark-users,
   I am using pyspark on a yarn cluster. One of my spark application launch
failed. Only the driver container had started before it failed on the
ACCEPTED state. The error message is very short and I cannot make sense of
it. The error message is attached below. Any possible causes for this
error?  Thank you!

18/05/18 06:33:09 INFO ApplicationMaster: Starting the user application in
a separate Thread
18/05/18 06:33:09 INFO ApplicationMaster: Waiting for spark context
initialization...
18/05/18 06:33:14 ERROR ApplicationMaster: User application exited with
status 2
18/05/18 06:33:14 INFO ApplicationMaster: Final app status: FAILED,
exitCode: 2, (reason: User application exited with status 2)
18/05/18 06:33:14 ERROR ApplicationMaster: Uncaught exception:
org.apache.spark.SparkException: Exception thrown in awaitResult:
at
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
at
org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:428)
at
org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:281)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:783)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:781)
at
org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
Caused by: org.apache.spark.SparkUserAppException: User application exited
with 2
at
org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:104)
at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:654)
18/05/18 06:33:14 INFO ApplicationMaster: Unregistering ApplicationMaster
with FAILED (diag message: User application exited with status 2)
18/05/18 06:33:14 INFO ApplicationMaster: Deleting staging directory