Spark Structured Streaming handles compressed files

2018-10-31 Thread Lian Jiang
We have jsonl files each of which is compressed as gz file. Is it possible
to make SSS to handle such files? Appreciate any help!


Rack Awareness in Spark

2018-10-31 Thread RuiyangChen
Hello everyone,
Is there is a way to specify rack awareness in Spark? For example, if I want
to use AggregatebyKey, is there a way to let Spark aggregate within the same
rack first, then aggregate between rack? I'm interested in this because I am
trying to figure whether there is a way to deal with limp inter-rack
network. 
I'm have searched through mailing list and StackOverflow, but all of them
are talking about rack awareness in HDFS instead of Spark.
Thanks a lot!

Ruiyang



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Zhang, Yuqi
Hi Li,

Thank you very much for your reply!

> Did you make the headless service that reflects the driver pod name?
I am not sure but I used “app” in the headless service as selector which is the 
same app name for the StatefulSet that will create the spark driver pod.
For your reference, I attached the yaml file for making headless service and 
StatefulSet. Could you please help me take a look at it if you have time?

I appreciate for your help & have a good day!

Best Regards,
--
Yuqi Zhang
Software Engineer
m: 090-6725-6573


[signature_147554612]

2 Chome-2-23-1 Akasaka
Minato, Tokyo 107-0052
teradata.com

This e-mail is from Teradata Corporation and may contain information that is 
confidential or proprietary. If you are not the intended recipient, do not 
read, copy or distribute the e-mail or any attachments. Instead, please notify 
the sender and delete the e-mail and any attachments. Thank you.

Please consider the environment before printing.



From: Li Gao 
Date: Thursday, November 1, 2018 4:56
To: "Zhang, Yuqi" 
Cc: Gourav Sengupta , "user@spark.apache.org" 
, "Nogami, Masatsugu" 
Subject: Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation 
regarding how to run spark-shell on k8s cluster?

Hi Yuqi,

Yes we are running Jupyter Gateway and kernels on k8s and using Spark 2.4's 
client mode to launch pyspark. In client mode your driver is running on the 
same pod where your kernel runs.

I am planning to write some blog post on this on some future date. Did you make 
the headless service that reflects the driver pod name? Thats one of critical 
pieces we automated in our custom code that makes the client mode works.

-Li


On Wed, Oct 31, 2018 at 8:13 AM Zhang, Yuqi 
mailto:yuqi.zh...@teradata.com>> wrote:
Hi Li,

Thank you for your reply.
Do you mean running Jupyter client on k8s cluster to use spark 2.4? Actually I 
am also trying to set up JupyterHub on k8s to use spark, that’s why I would 
like to know how to run spark client mode on k8s cluster. If there is any 
related documentation on how to set up the Jupyter on k8s to use spark, could 
you please share with me?

Thank you for your help!

Best Regards,
--
Yuqi Zhang
Software Engineer
m: 090-6725-6573


[signature_147554612]

2 Chome-2-23-1 Akasaka
Minato, Tokyo 107-0052
teradata.com
This e-mail is from Teradata Corporation and may contain information that is 
confidential or proprietary. If you are not the intended recipient, do not 
read, copy or distribute the e-mail or any attachments. Instead, please notify 
the sender and delete the e-mail and any attachments. Thank you.

Please consider the environment before printing.



From: Li Gao mailto:ligao...@gmail.com>>
Date: Thursday, November 1, 2018 0:07
To: "Zhang, Yuqi" 
Cc: "gourav.sengu...@gmail.com" 
mailto:gourav.sengu...@gmail.com>>, 
"user@spark.apache.org" 
mailto:user@spark.apache.org>>, "Nogami, Masatsugu" 

Subject: Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation 
regarding how to run spark-shell on k8s cluster?

Yuqi,

Your error seems unrelated to headless service config you need to enable. For 
headless service you need to create a headless service that matches to your 
driver pod name exactly in order for spark 2.4 RC to work under client mode. We 
have this running for a while now using Jupyter kernel as the driver client.

-Li


On Wed, Oct 31, 2018 at 7:30 AM Zhang, Yuqi 
mailto:yuqi.zh...@teradata.com>> wrote:
Hi Gourav,

Thank you for your reply.

I haven’t try glue or EMK, but I guess it’s integrating kubernetes on aws 
instances?
I could set up the k8s cluster on AWS, but my problem is don’t know how to run 
spark-shell on kubernetes…
Since spark only support client mode on k8s from 2.4 version which is not 
officially released yet, I would like to ask if there is more detailed 
documentation regarding the way to run spark-shell on k8s cluster?

Thank you in advance & best regards!

--
Yuqi Zhang
Software Engineer
m: 090-6725-6573


Error! Filename not specified.

2 Chome-2-23-1 Akasaka
Minato, Tokyo 107-0052
teradata.com
This e-mail is from Teradata Corporation and may contain information that is 
confidential or proprietary. If you are not the intended recipient, do not 
read, copy or distribute the e-mail or any attachments. Instead, please notify 
the sender and delete the e-mail and any attachments. Thank you.

Please consider the environment before printing.



From: Gourav Sengupta 
mailto:gourav.sengu...@gmail.com>>
Date: Wednesday, October 31, 2018 18:34
To: "Zhang, Yuqi" 
Cc: user mailto:user@spark.apache.org>>, "Nogami, 
Masatsugu" 
Subject: Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation 
regarding how to run spark-shell on k8s cluster?

[External Email]


Re: Iterator of KeyValueGroupedDataset.flatMapGroupsWithState function

2018-10-31 Thread Tathagata Das
It is okay to collect the iterator. That will not break Spark. However,
collecting it requires memory in the executor, so you may cause OOMs if a
group has a LOT of new data.

On Wed, Oct 31, 2018 at 3:44 AM Antonio Murgia -
antonio.murg...@studio.unibo.it  wrote:

> Hi all,
>
> I'm currently developing a Spark Structured Streaming job and I'm
> performing flatMapGroupsWithState.
>
> I'm concerned about the laziness of the Iterator[V] that is passed to my
> custom function (func: (K, Iterator[V], GroupState[S]) => Iterator[U]).
>
> Is it ok to collect that iterator (with a toList, for example)? I have a
> logic that is practically impossible to perform on a Iterator, but I do not
> want to break Spark lazy chain, obviously.
>
>
> Thank you in advance.
>
>
> #A.M.
>


Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Li Gao
Hi Yuqi,

Yes we are running Jupyter Gateway and kernels on k8s and using Spark 2.4's
client mode to launch pyspark. In client mode your driver is running on the
same pod where your kernel runs.

I am planning to write some blog post on this on some future date. Did you
make the headless service that reflects the driver pod name? Thats one of
critical pieces we automated in our custom code that makes the client mode
works.

-Li


On Wed, Oct 31, 2018 at 8:13 AM Zhang, Yuqi  wrote:

> Hi Li,
>
>
>
> Thank you for your reply.
>
> Do you mean running Jupyter client on k8s cluster to use spark 2.4?
> Actually I am also trying to set up JupyterHub on k8s to use spark, that’s
> why I would like to know how to run spark client mode on k8s cluster. If
> there is any related documentation on how to set up the Jupyter on k8s to
> use spark, could you please share with me?
>
>
>
> Thank you for your help!
>
>
>
> Best Regards,
>
> --
>
> Yuqi Zhang
>
> Software Engineer
>
> m: 090-6725-6573
>
>
> [image: signature_147554612] 
>
> 2 Chome-2-23-1 Akasaka
>
> Minato, Tokyo 107-0052
> teradata.com 
>
> This e-mail is from Teradata Corporation and may contain information that
> is confidential or proprietary. If you are not the intended recipient, do
> not read, copy or distribute the e-mail or any attachments. Instead, please
> notify the sender and delete the e-mail and any attachments. Thank you.
>
> Please consider the environment before printing.
>
>
>
>
>
>
>
> *From: *Li Gao 
> *Date: *Thursday, November 1, 2018 0:07
> *To: *"Zhang, Yuqi" 
> *Cc: *"gourav.sengu...@gmail.com" , "
> user@spark.apache.org" , "Nogami, Masatsugu"
> 
> *Subject: *Re: [Spark Shell on AWS K8s Cluster]: Is there more
> documentation regarding how to run spark-shell on k8s cluster?
>
>
>
> Yuqi,
>
>
>
> Your error seems unrelated to headless service config you need to enable.
> For headless service you need to create a headless service that matches to
> your driver pod name exactly in order for spark 2.4 RC to work under client
> mode. We have this running for a while now using Jupyter kernel as the
> driver client.
>
>
>
> -Li
>
>
>
>
>
> On Wed, Oct 31, 2018 at 7:30 AM Zhang, Yuqi 
> wrote:
>
> Hi Gourav,
>
>
>
> Thank you for your reply.
>
>
>
> I haven’t try glue or EMK, but I guess it’s integrating kubernetes on aws
> instances?
>
> I could set up the k8s cluster on AWS, but my problem is don’t know how to
> run spark-shell on kubernetes…
>
> Since spark only support client mode on k8s from 2.4 version which is not
> officially released yet, I would like to ask if there is more detailed
> documentation regarding the way to run spark-shell on k8s cluster?
>
>
>
> Thank you in advance & best regards!
>
>
>
> --
>
> Yuqi Zhang
>
> Software Engineer
>
> m: 090-6725-6573
>
>
> [image: signature_147554612] 
>
> 2 Chome-2-23-1 Akasaka
>
> Minato, Tokyo 107-0052
> teradata.com 
>
> This e-mail is from Teradata Corporation and may contain information that
> is confidential or proprietary. If you are not the intended recipient, do
> not read, copy or distribute the e-mail or any attachments. Instead, please
> notify the sender and delete the e-mail and any attachments. Thank you.
>
> Please consider the environment before printing.
>
>
>
>
>
>
>
> *From: *Gourav Sengupta 
> *Date: *Wednesday, October 31, 2018 18:34
> *To: *"Zhang, Yuqi" 
> *Cc: *user , "Nogami, Masatsugu"
> 
> *Subject: *Re: [Spark Shell on AWS K8s Cluster]: Is there more
> documentation regarding how to run spark-shell on k8s cluster?
>
>
>
> [External Email]
> --
>
> Just out of curiosity why would you not use Glue (which is Spark on
> kubernetes) or EMR?
>
>
>
> Regards,
>
> Gourav Sengupta
>
>
>
> On Mon, Oct 29, 2018 at 1:29 AM Zhang, Yuqi 
> wrote:
>
> Hello guys,
>
>
>
> I am Yuqi from Teradata Tokyo. Sorry to disturb but I have some problem
> regarding using spark 2.4 client mode function on kubernetes cluster, so I
> would like to ask if there is some solution to my problem.
>
>
>
> The problem is when I am trying to run spark-shell on kubernetes v1.11.3
> cluster on AWS environment, I couldn’t successfully run stateful set using
> the docker image built from spark 2.4. The error message is showing below.
> The version I am using is spark v2.4.0-rc3.
>
>
>
> Also, I wonder if there is more documentation on how to use client-mode or
> integrate spark-shell on kubernetes cluster. From the documentation on
> https://github.com/apache/spark/blob/v2.4.0-rc3/docs/running-on-kubernetes.md
> there is only a brief description. I understand it’s not the official
> released version yet, but If there is some more documentation, could you
> please share with me?
>
>
>
> Thank you very much for your help!
>
>
>
>
>
> Error msg:
>
> + env
>
> + sed 's/[^=]*=\(.*\)/\1/g'
>
> + sort -t_ -k4 -n
>
> + grep SPARK_JAVA_OPT_
>
> + readarray -t 

Re: Apache Spark orc read performance when reading large number of small files

2018-10-31 Thread gpatcham
spark version 2.2.0
Hive version 1.1.0

There are lot of small files

Spark code :

"spark.sql.orc.enabled": "true",
"spark.sql.orc.filterPushdown": "true 

val logs
=spark.read.schema(schema).orc("hdfs://test/date=201810").filter("date >
20181003")

Hive:

"spark.sql.orc.enabled": "true",
"spark.sql.orc.filterPushdown": "true 

test  table in Hive is pointing to hdfs://test/  and partitioned on date

val sqlStr = s"select * from test where date > 20181001"
val logs = spark.sql(sqlStr)

With Hive query I don't see filter pushdown is  happening. I tried setting
these configs in both hive-site.xml and also spark.sqlContext.setConf

"hive.optimize.ppd":"true",
"hive.optimize.ppd.storage":"true" 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Apache Spark orc read performance when reading large number of small files

2018-10-31 Thread Jörn Franke
How large are they? A lot of (small) files will cause significant delay in 
progressing - try to merge as much as possible into one file.

Can you please share full source code in Hive and Spark as well as the versions 
you are using?

> Am 31.10.2018 um 18:23 schrieb gpatcham :
> 
> 
> 
> When reading large number of orc files from HDFS under a directory spark
> doesn't launch any tasks until some amount of time and I don't see any tasks
> running during that time. I'm using below command to read orc and spark.sql
> configs.
> 
> What spark is doing under hoods when spark.read.orc is issued?
> 
> spark.read.schema(schame1).orc("hdfs://test1").filter("date >= 20181001")
> "spark.sql.orc.enabled": "true",
> "spark.sql.orc.filterPushdown": "true
> 
> Also instead of directly reading orc files I tried running Hive query on
> same dataset. But I was not able to push filter predicate. Where should I
> set the below config's "hive.optimize.ppd":"true",
> "hive.optimize.ppd.storage":"true"
> 
> Suggest what is the best way to read orc files from HDFS and tuning
> parameters ?
> 
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Apache Spark orc read performance when reading large number of small files

2018-10-31 Thread gpatcham



When reading large number of orc files from HDFS under a directory spark
doesn't launch any tasks until some amount of time and I don't see any tasks
running during that time. I'm using below command to read orc and spark.sql
configs.

What spark is doing under hoods when spark.read.orc is issued?

spark.read.schema(schame1).orc("hdfs://test1").filter("date >= 20181001")
"spark.sql.orc.enabled": "true",
"spark.sql.orc.filterPushdown": "true

Also instead of directly reading orc files I tried running Hive query on
same dataset. But I was not able to push filter predicate. Where should I
set the below config's "hive.optimize.ppd":"true",
"hive.optimize.ppd.storage":"true"

Suggest what is the best way to read orc files from HDFS and tuning
parameters ?




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Zhang, Yuqi
Hi Li,

Thank you for your reply.
Do you mean running Jupyter client on k8s cluster to use spark 2.4? Actually I 
am also trying to set up JupyterHub on k8s to use spark, that’s why I would 
like to know how to run spark client mode on k8s cluster. If there is any 
related documentation on how to set up the Jupyter on k8s to use spark, could 
you please share with me?

Thank you for your help!

Best Regards,
--
Yuqi Zhang
Software Engineer
m: 090-6725-6573


[signature_147554612]

2 Chome-2-23-1 Akasaka
Minato, Tokyo 107-0052
teradata.com

This e-mail is from Teradata Corporation and may contain information that is 
confidential or proprietary. If you are not the intended recipient, do not 
read, copy or distribute the e-mail or any attachments. Instead, please notify 
the sender and delete the e-mail and any attachments. Thank you.

Please consider the environment before printing.



From: Li Gao 
Date: Thursday, November 1, 2018 0:07
To: "Zhang, Yuqi" 
Cc: "gourav.sengu...@gmail.com" , 
"user@spark.apache.org" , "Nogami, Masatsugu" 

Subject: Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation 
regarding how to run spark-shell on k8s cluster?

Yuqi,

Your error seems unrelated to headless service config you need to enable. For 
headless service you need to create a headless service that matches to your 
driver pod name exactly in order for spark 2.4 RC to work under client mode. We 
have this running for a while now using Jupyter kernel as the driver client.

-Li


On Wed, Oct 31, 2018 at 7:30 AM Zhang, Yuqi 
mailto:yuqi.zh...@teradata.com>> wrote:
Hi Gourav,

Thank you for your reply.

I haven’t try glue or EMK, but I guess it’s integrating kubernetes on aws 
instances?
I could set up the k8s cluster on AWS, but my problem is don’t know how to run 
spark-shell on kubernetes…
Since spark only support client mode on k8s from 2.4 version which is not 
officially released yet, I would like to ask if there is more detailed 
documentation regarding the way to run spark-shell on k8s cluster?

Thank you in advance & best regards!

--
Yuqi Zhang
Software Engineer
m: 090-6725-6573


[signature_147554612]

2 Chome-2-23-1 Akasaka
Minato, Tokyo 107-0052
teradata.com
This e-mail is from Teradata Corporation and may contain information that is 
confidential or proprietary. If you are not the intended recipient, do not 
read, copy or distribute the e-mail or any attachments. Instead, please notify 
the sender and delete the e-mail and any attachments. Thank you.

Please consider the environment before printing.



From: Gourav Sengupta 
mailto:gourav.sengu...@gmail.com>>
Date: Wednesday, October 31, 2018 18:34
To: "Zhang, Yuqi" 
Cc: user mailto:user@spark.apache.org>>, "Nogami, 
Masatsugu" 
Subject: Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation 
regarding how to run spark-shell on k8s cluster?

[External Email]

Just out of curiosity why would you not use Glue (which is Spark on kubernetes) 
or EMR?

Regards,
Gourav Sengupta

On Mon, Oct 29, 2018 at 1:29 AM Zhang, Yuqi 
mailto:yuqi.zh...@teradata.com>> wrote:
Hello guys,

I am Yuqi from Teradata Tokyo. Sorry to disturb but I have some problem 
regarding using spark 2.4 client mode function on kubernetes cluster, so I 
would like to ask if there is some solution to my problem.

The problem is when I am trying to run spark-shell on kubernetes v1.11.3 
cluster on AWS environment, I couldn’t successfully run stateful set using the 
docker image built from spark 2.4. The error message is showing below. The 
version I am using is spark v2.4.0-rc3.

Also, I wonder if there is more documentation on how to use client-mode or 
integrate spark-shell on kubernetes cluster. From the documentation on 
https://github.com/apache/spark/blob/v2.4.0-rc3/docs/running-on-kubernetes.md 
there is only a brief description. I understand it’s not the official released 
version yet, but If there is some more documentation, could you please share 
with me?

Thank you very much for your help!


Error msg:
+ env
+ sed 's/[^=]*=\(.*\)/\1/g'
+ sort -t_ -k4 -n
+ grep SPARK_JAVA_OPT_
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -n '' ']'
+ PYSPARK_ARGS=
+ '[' -n '' ']'
+ R_ARGS=
+ '[' -n '' ']'
+ '[' '' == 2 ']'
+ '[' '' == 3 ']'
+ case "$SPARK_K8S_CMD" in
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf 
"spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress= --deploy-mode client
Error: Missing application resource.
Usage: spark-submit [options]  [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]


--
Yuqi Zhang
Software Engineer
m: 090-6725-6573


Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Li Gao
Yuqi,

Your error seems unrelated to headless service config you need to enable.
For headless service you need to create a headless service that matches to
your driver pod name exactly in order for spark 2.4 RC to work under client
mode. We have this running for a while now using Jupyter kernel as the
driver client.

-Li


On Wed, Oct 31, 2018 at 7:30 AM Zhang, Yuqi  wrote:

> Hi Gourav,
>
>
>
> Thank you for your reply.
>
>
>
> I haven’t try glue or EMK, but I guess it’s integrating kubernetes on aws
> instances?
>
> I could set up the k8s cluster on AWS, but my problem is don’t know how to
> run spark-shell on kubernetes…
>
> Since spark only support client mode on k8s from 2.4 version which is not
> officially released yet, I would like to ask if there is more detailed
> documentation regarding the way to run spark-shell on k8s cluster?
>
>
>
> Thank you in advance & best regards!
>
>
>
> --
>
> Yuqi Zhang
>
> Software Engineer
>
> m: 090-6725-6573
>
>
> [image: signature_147554612] 
>
> 2 Chome-2-23-1 Akasaka
>
> Minato, Tokyo 107-0052
> teradata.com 
>
> This e-mail is from Teradata Corporation and may contain information that
> is confidential or proprietary. If you are not the intended recipient, do
> not read, copy or distribute the e-mail or any attachments. Instead, please
> notify the sender and delete the e-mail and any attachments. Thank you.
>
> Please consider the environment before printing.
>
>
>
>
>
>
>
> *From: *Gourav Sengupta 
> *Date: *Wednesday, October 31, 2018 18:34
> *To: *"Zhang, Yuqi" 
> *Cc: *user , "Nogami, Masatsugu"
> 
> *Subject: *Re: [Spark Shell on AWS K8s Cluster]: Is there more
> documentation regarding how to run spark-shell on k8s cluster?
>
>
>
> [External Email]
> --
>
> Just out of curiosity why would you not use Glue (which is Spark on
> kubernetes) or EMR?
>
>
>
> Regards,
>
> Gourav Sengupta
>
>
>
> On Mon, Oct 29, 2018 at 1:29 AM Zhang, Yuqi 
> wrote:
>
> Hello guys,
>
>
>
> I am Yuqi from Teradata Tokyo. Sorry to disturb but I have some problem
> regarding using spark 2.4 client mode function on kubernetes cluster, so I
> would like to ask if there is some solution to my problem.
>
>
>
> The problem is when I am trying to run spark-shell on kubernetes v1.11.3
> cluster on AWS environment, I couldn’t successfully run stateful set using
> the docker image built from spark 2.4. The error message is showing below.
> The version I am using is spark v2.4.0-rc3.
>
>
>
> Also, I wonder if there is more documentation on how to use client-mode or
> integrate spark-shell on kubernetes cluster. From the documentation on
> https://github.com/apache/spark/blob/v2.4.0-rc3/docs/running-on-kubernetes.md
> there is only a brief description. I understand it’s not the official
> released version yet, but If there is some more documentation, could you
> please share with me?
>
>
>
> Thank you very much for your help!
>
>
>
>
>
> Error msg:
>
> + env
>
> + sed 's/[^=]*=\(.*\)/\1/g'
>
> + sort -t_ -k4 -n
>
> + grep SPARK_JAVA_OPT_
>
> + readarray -t SPARK_EXECUTOR_JAVA_OPTS
>
> + '[' -n '' ']'
>
> + '[' -n '' ']'
>
> + PYSPARK_ARGS=
>
> + '[' -n '' ']'
>
> + R_ARGS=
>
> + '[' -n '' ']'
>
> + '[' '' == 2 ']'
>
> + '[' '' == 3 ']'
>
> + case "$SPARK_K8S_CMD" in
>
> + CMD=("$SPARK_HOME/bin/spark-submit" --conf
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client
> "$@")
>
> + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf
> spark.driver.bindAddress= --deploy-mode client
>
> Error: Missing application resource.
>
> Usage: spark-submit [options]  [app
> arguments]
>
> Usage: spark-submit --kill [submission ID] --master [spark://...]
>
> Usage: spark-submit --status [submission ID] --master [spark://...]
>
> Usage: spark-submit run-example [options] example-class [example args]
>
>
>
>
>
> --
>
> Yuqi Zhang
>
> Software Engineer
>
> m: 090-6725-6573
>
>
> [image: signature_147554612] 
>
> 2 Chome-2-23-1 Akasaka
>
> Minato, Tokyo 107-0052
> teradata.com 
>
> This e-mail is from Teradata Corporation and may contain information that
> is confidential or proprietary. If you are not the intended recipient, do
> not read, copy or distribute the e-mail or any attachments. Instead, please
> notify the sender and delete the e-mail and any attachments. Thank you.
>
> Please consider the environment before printing.
>
>
>
>
>
>


Re: I want run deep neural network on Spark

2018-10-31 Thread hager
very thanks
here any one run on pyspark



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: I want run deep neural network on Spark

2018-10-31 Thread Kunkel, Michael C.
Greetings,

There are libraries for deep neural nets that can be used with spark. DL4J is 
one, and it’s as simple as changing a constructor and the maven dependency.

BR
MK

Michael C. Kunkel, USMC, PhD
Forschungszentrum Jülich
Nuclear Physics Institute and Juelich Center for Hadron Physics
Experimental Hadron Structure (IKP-1)
www.fz-juelich.de/ikp

On Oct 31, 2018, at 15:09, hager 
mailto:loveallah1...@yahoo.com>> wrote:

There are any libraries in spark to support deep neural network



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org





Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt





Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Zhang, Yuqi
Hi Gourav,

Thank you for your reply.

I haven’t try glue or EMK, but I guess it’s integrating kubernetes on aws 
instances?
I could set up the k8s cluster on AWS, but my problem is don’t know how to run 
spark-shell on kubernetes…
Since spark only support client mode on k8s from 2.4 version which is not 
officially released yet, I would like to ask if there is more detailed 
documentation regarding the way to run spark-shell on k8s cluster?

Thank you in advance & best regards!

--
Yuqi Zhang
Software Engineer
m: 090-6725-6573


[signature_147554612]

2 Chome-2-23-1 Akasaka
Minato, Tokyo 107-0052
teradata.com

This e-mail is from Teradata Corporation and may contain information that is 
confidential or proprietary. If you are not the intended recipient, do not 
read, copy or distribute the e-mail or any attachments. Instead, please notify 
the sender and delete the e-mail and any attachments. Thank you.

Please consider the environment before printing.



From: Gourav Sengupta 
Date: Wednesday, October 31, 2018 18:34
To: "Zhang, Yuqi" 
Cc: user , "Nogami, Masatsugu" 

Subject: Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation 
regarding how to run spark-shell on k8s cluster?

[External Email]

Just out of curiosity why would you not use Glue (which is Spark on kubernetes) 
or EMR?

Regards,
Gourav Sengupta

On Mon, Oct 29, 2018 at 1:29 AM Zhang, Yuqi 
mailto:yuqi.zh...@teradata.com>> wrote:
Hello guys,

I am Yuqi from Teradata Tokyo. Sorry to disturb but I have some problem 
regarding using spark 2.4 client mode function on kubernetes cluster, so I 
would like to ask if there is some solution to my problem.

The problem is when I am trying to run spark-shell on kubernetes v1.11.3 
cluster on AWS environment, I couldn’t successfully run stateful set using the 
docker image built from spark 2.4. The error message is showing below. The 
version I am using is spark v2.4.0-rc3.

Also, I wonder if there is more documentation on how to use client-mode or 
integrate spark-shell on kubernetes cluster. From the documentation on 
https://github.com/apache/spark/blob/v2.4.0-rc3/docs/running-on-kubernetes.md 
there is only a brief description. I understand it’s not the official released 
version yet, but If there is some more documentation, could you please share 
with me?

Thank you very much for your help!


Error msg:
+ env
+ sed 's/[^=]*=\(.*\)/\1/g'
+ sort -t_ -k4 -n
+ grep SPARK_JAVA_OPT_
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -n '' ']'
+ PYSPARK_ARGS=
+ '[' -n '' ']'
+ R_ARGS=
+ '[' -n '' ']'
+ '[' '' == 2 ']'
+ '[' '' == 3 ']'
+ case "$SPARK_K8S_CMD" in
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf 
"spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress= --deploy-mode client
Error: Missing application resource.
Usage: spark-submit [options]  [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]


--
Yuqi Zhang
Software Engineer
m: 090-6725-6573


[signature_147554612]

2 Chome-2-23-1 Akasaka
Minato, Tokyo 107-0052
teradata.com

This e-mail is from Teradata Corporation and may contain information that is 
confidential or proprietary. If you are not the intended recipient, do not 
read, copy or distribute the e-mail or any attachments. Instead, please notify 
the sender and delete the e-mail and any attachments. Thank you.

Please consider the environment before printing.




I want run deep neural network on Spark

2018-10-31 Thread hager
There are any libraries in spark to support deep neural network



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: dremel paper example schema

2018-10-31 Thread lchorbadjiev
Hi Jorn,

Thanks for the help. I switched to using Apache Parquet 1.8.3 and now Spark
successfully loads the parquet file.

Do you have any hint for the other part of my question? What is the correct
way to reproduce this schema:

message Document {
  required int64 DocId;
  optional group Links {
repeated int64 Backward;
repeated int64 Forward;
  }
  repeated group Name {
repeated group Language {
  required binary Code;
  optional binary Country;
}
optional binary Url;
  }
}

using Apache Spark SQL types?

Thanks,
Lubomir Chorbadjiev



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: dremel paper example schema

2018-10-31 Thread Jörn Franke
I would try with the same version as Spark uses first. I don’t have the 
changelog of parquet in my head (but you can find it ok the Internet), but it 
could be the cause of your issues.

> Am 31.10.2018 um 12:26 schrieb lchorbadjiev :
> 
> Hi Jorn,
> 
> I am using Apache Spark 2.3.1.
> 
> For creating the parquet file I have used Apache Parquet (parquet-mr) 1.10.
> This does not match the version of parquet used in Apache Spark 2.3.1 and if
> you think that this could be the problem I could try to use Apache Parquet
> version 1.8.3.
> 
> I created a parquet file using Apache Spark SQL types, but can not make the
> resulting schema to match the schema described in the paper.
> 
> What I do is to use Spark SQL array type for repeated values. For example,
> where papers says
> 
>repeated int64 Backward;
> 
> I use array type:
> 
>StructField("Backward", ArrayType(IntegerType(), containsNull=False),
> nullable=False)
> 
> The resulting schema, reported by parquet-tools is:
> 
>optional group backward (LIST) {
>  repeated group list {
>required int32 element;
>  }
>}
> 
> Thanks,
> Lubomir Chorbadjiev
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: dremel paper example schema

2018-10-31 Thread lchorbadjiev
Hi Jorn,

I am using Apache Spark 2.3.1.

For creating the parquet file I have used Apache Parquet (parquet-mr) 1.10.
This does not match the version of parquet used in Apache Spark 2.3.1 and if
you think that this could be the problem I could try to use Apache Parquet
version 1.8.3.

I created a parquet file using Apache Spark SQL types, but can not make the
resulting schema to match the schema described in the paper.

What I do is to use Spark SQL array type for repeated values. For example,
where papers says

repeated int64 Backward;

I use array type:

StructField("Backward", ArrayType(IntegerType(), containsNull=False),
nullable=False)

The resulting schema, reported by parquet-tools is:

optional group backward (LIST) {
  repeated group list {
required int32 element;
  }
}

Thanks,
Lubomir Chorbadjiev



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Iterator of KeyValueGroupedDataset.flatMapGroupsWithState function

2018-10-31 Thread Antonio Murgia - antonio.murg...@studio.unibo.it
Hi all,

I'm currently developing a Spark Structured Streaming job and I'm performing 
flatMapGroupsWithState.

I'm concerned about the laziness of the Iterator[V] that is passed to my custom 
function (func: (K, Iterator[V], GroupState[S]) => Iterator[U]).

Is it ok to collect that iterator (with a toList, for example)? I have a logic 
that is practically impossible to perform on a Iterator, but I do not want to 
break Spark lazy chain, obviously.


Thank you in advance.


#A.M.


Event Hubs properties kvp-value adds " to strings

2018-10-31 Thread Magnus Nilsson
Hello all,

I have this peculiar problem where quote " characters are added to the
beginning and end of my string values.

I get data using Structured Streaming from an Azure Event Hub using a Scala
Notebook in Azure Databricks.

The Dataframe schema received contain a property of type Map named
"properties" containing string/string key value pairs. Only one pair in my
case.

I input data from a C# program where the "properties" property in the
EventData schema is a Dictionary where I input a
string/string property pair.

If I input a string value of _Hello_ in C# I get back a string value of
_"Hello"_ in my Dataframe.

In the body tag of the Schema I input a UTF-8 byte[] and get back a UTF-8
byte[] which I cast to string, this turns out correct, no " are added to
the string value.

If I try to pass in a UTF-8 string as a byte[] as the value of the KvP I
get a serialization exception so that's a no go.

Any idea on why or where the quote characters "" are added to the value of
the string when I read them in Spark?

Any ideas would greatly be appreciated.

//Magnus


Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Biplob Biswas
Hi Yuqi,

Just curious can you share the spark-submit script and what are you passing
as --master argument?

Thanks & Regards
Biplob Biswas


On Wed, Oct 31, 2018 at 10:34 AM Gourav Sengupta 
wrote:

> Just out of curiosity why would you not use Glue (which is Spark on
> kubernetes) or EMR?
>
> Regards,
> Gourav Sengupta
>
> On Mon, Oct 29, 2018 at 1:29 AM Zhang, Yuqi 
> wrote:
>
>> Hello guys,
>>
>>
>>
>> I am Yuqi from Teradata Tokyo. Sorry to disturb but I have some problem
>> regarding using spark 2.4 client mode function on kubernetes cluster, so I
>> would like to ask if there is some solution to my problem.
>>
>>
>>
>> The problem is when I am trying to run spark-shell on kubernetes v1.11.3
>> cluster on AWS environment, I couldn’t successfully run stateful set using
>> the docker image built from spark 2.4. The error message is showing below.
>> The version I am using is spark v2.4.0-rc3.
>>
>>
>>
>> Also, I wonder if there is more documentation on how to use client-mode
>> or integrate spark-shell on kubernetes cluster. From the documentation on
>> https://github.com/apache/spark/blob/v2.4.0-rc3/docs/running-on-kubernetes.md
>> there is only a brief description. I understand it’s not the official
>> released version yet, but If there is some more documentation, could you
>> please share with me?
>>
>>
>>
>> Thank you very much for your help!
>>
>>
>>
>>
>>
>> Error msg:
>>
>> + env
>>
>> + sed 's/[^=]*=\(.*\)/\1/g'
>>
>> + sort -t_ -k4 -n
>>
>> + grep SPARK_JAVA_OPT_
>>
>> + readarray -t SPARK_EXECUTOR_JAVA_OPTS
>>
>> + '[' -n '' ']'
>>
>> + '[' -n '' ']'
>>
>> + PYSPARK_ARGS=
>>
>> + '[' -n '' ']'
>>
>> + R_ARGS=
>>
>> + '[' -n '' ']'
>>
>> + '[' '' == 2 ']'
>>
>> + '[' '' == 3 ']'
>>
>> + case "$SPARK_K8S_CMD" in
>>
>> + CMD=("$SPARK_HOME/bin/spark-submit" --conf
>> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client
>> "$@")
>>
>> + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf
>> spark.driver.bindAddress= --deploy-mode client
>>
>> Error: Missing application resource.
>>
>> Usage: spark-submit [options]  [app
>> arguments]
>>
>> Usage: spark-submit --kill [submission ID] --master [spark://...]
>>
>> Usage: spark-submit --status [submission ID] --master [spark://...]
>>
>> Usage: spark-submit run-example [options] example-class [example args]
>>
>>
>>
>>
>>
>> --
>>
>> Yuqi Zhang
>>
>> Software Engineer
>>
>> m: 090-6725-6573
>>
>>
>> [image: signature_147554612] 
>>
>> 2 Chome-2-23-1 Akasaka
>>
>> Minato, Tokyo 107-0052
>> teradata.com 
>>
>>
>> This e-mail is from Teradata Corporation and may contain information that
>> is confidential or proprietary. If you are not the intended recipient, do
>> not read, copy or distribute the e-mail or any attachments. Instead, please
>> notify the sender and delete the e-mail and any attachments. Thank you.
>>
>> Please consider the environment before printing.
>>
>>
>>
>>
>>
>


Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Gourav Sengupta
Just out of curiosity why would you not use Glue (which is Spark on
kubernetes) or EMR?

Regards,
Gourav Sengupta

On Mon, Oct 29, 2018 at 1:29 AM Zhang, Yuqi  wrote:

> Hello guys,
>
>
>
> I am Yuqi from Teradata Tokyo. Sorry to disturb but I have some problem
> regarding using spark 2.4 client mode function on kubernetes cluster, so I
> would like to ask if there is some solution to my problem.
>
>
>
> The problem is when I am trying to run spark-shell on kubernetes v1.11.3
> cluster on AWS environment, I couldn’t successfully run stateful set using
> the docker image built from spark 2.4. The error message is showing below.
> The version I am using is spark v2.4.0-rc3.
>
>
>
> Also, I wonder if there is more documentation on how to use client-mode or
> integrate spark-shell on kubernetes cluster. From the documentation on
> https://github.com/apache/spark/blob/v2.4.0-rc3/docs/running-on-kubernetes.md
> there is only a brief description. I understand it’s not the official
> released version yet, but If there is some more documentation, could you
> please share with me?
>
>
>
> Thank you very much for your help!
>
>
>
>
>
> Error msg:
>
> + env
>
> + sed 's/[^=]*=\(.*\)/\1/g'
>
> + sort -t_ -k4 -n
>
> + grep SPARK_JAVA_OPT_
>
> + readarray -t SPARK_EXECUTOR_JAVA_OPTS
>
> + '[' -n '' ']'
>
> + '[' -n '' ']'
>
> + PYSPARK_ARGS=
>
> + '[' -n '' ']'
>
> + R_ARGS=
>
> + '[' -n '' ']'
>
> + '[' '' == 2 ']'
>
> + '[' '' == 3 ']'
>
> + case "$SPARK_K8S_CMD" in
>
> + CMD=("$SPARK_HOME/bin/spark-submit" --conf
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client
> "$@")
>
> + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf
> spark.driver.bindAddress= --deploy-mode client
>
> Error: Missing application resource.
>
> Usage: spark-submit [options]  [app
> arguments]
>
> Usage: spark-submit --kill [submission ID] --master [spark://...]
>
> Usage: spark-submit --status [submission ID] --master [spark://...]
>
> Usage: spark-submit run-example [options] example-class [example args]
>
>
>
>
>
> --
>
> Yuqi Zhang
>
> Software Engineer
>
> m: 090-6725-6573
>
>
> [image: signature_147554612] 
>
> 2 Chome-2-23-1 Akasaka
>
> Minato, Tokyo 107-0052
> teradata.com 
>
>
> This e-mail is from Teradata Corporation and may contain information that
> is confidential or proprietary. If you are not the intended recipient, do
> not read, copy or distribute the e-mail or any attachments. Instead, please
> notify the sender and delete the e-mail and any attachments. Thank you.
>
> Please consider the environment before printing.
>
>
>
>
>