Spark billing on shared Clusters

2017-08-23 Thread Jorge Machado
Hi everyone, I was wondering how it is possible to do Spark / Yarn accounting on a shared cluster based on resource usage. I found out that is no way to do that. So I develop hbilling to deal with this. Is someone interested on a quick demo or so ? More info under: www.hbilling.io

RE: A bug in spark or hadoop RPC with kerberos authentication?

2017-08-23 Thread Sun, Keith
Finally find the root cause and raise a bug issue in https://issues.apache.org/jira/browse/SPARK-21819 Thanks very much. Keith From: Sun, Keith Sent: 2017年8月22日 8:48 To: user@spark.apache.org Subject: A bug in spark or hadoop RPC with kerberos authentication? Hello , I met this very weird

RE: A bug in spark or hadoop RPC with kerberos authentication?

2017-08-23 Thread Sun, Keith
Thanks for the reply, I filled an issue in JIRA https://issues.apache.org/jira/browse/SPARK-21819 I submitted the job from Java API, not by the spark-submit command line as we want to make spark processing as a service . Configuration hc = new Configuration(false);

Training A ML Model on a Huge Dataframe

2017-08-23 Thread Sea aj
Hi, I am trying to feed a huge dataframe to a ml algorithm in Spark but it crashes due to the shortage of memory. Is there a way to train the model on a subset of the data in multiple steps? Thanks Sent with Mailtrack

Re: Training A ML Model on a Huge Dataframe

2017-08-23 Thread Suzen, Mehmet
It depends on what model you would like to train but models requiring optimisation could use SGD with mini batches. See: https://spark.apache.org/docs/latest/mllib-optimization.html#stochastic-gradient-descent-sgd On 23 August 2017 at 14:27, Sea aj wrote: > Hi, > > I am

filter function works incorretly (Python)

2017-08-23 Thread AlexanderModestov
Hello All! I'm trying to filter some rows in my DataFrame. I created a list with ids and I use the construction: df_new = df.filter(df.user.isin(list_users)) The first (df) DataFrame consists on 29711562 rows but the new one - 5394805. OK, I've decided to use another one method: df_new =

Re: Training A ML Model on a Huge Dataframe

2017-08-23 Thread Sea aj
Thanks for the reply. As far as I understood mini batch is not yet supported in ML libarary. As for MLLib minibatch, I could not find any pyspark api. Sent with Mailtrack On Wed, Aug 23, 2017 at

Re: Chaining Spark Streaming Jobs

2017-08-23 Thread Michael Armbrust
If you use structured streaming and the file sink, you can have a subsequent stream read using the file source. This will maintain exactly once processing even if there are hiccups or failures. On Mon, Aug 21, 2017 at 2:02 PM, Sunita Arvind wrote: > Hello Spark Experts,

Re: Joining 2 dataframes, getting result as nested list/structure in dataframe

2017-08-23 Thread Michael Armbrust
You can create a nested struct that contains multiple columns using struct(). Here's a pretty complete guide on working with nested data: https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html On Wed, Aug 23, 2017 at 2:30 PM, JG Perrin

Re: PySpark, Structured Streaming and Kafka

2017-08-23 Thread Riccardo Ferrari
Hi Brian, Very nice work you have done! WRT you issue: Can you clarify how are you adding the kafka dependency when using Jupyter? The ClassNotFoundException really tells you about the missing dependency. A bit different is the IllegalArgumentException error, that is simply because you are not

Re: PySpark, Structured Streaming and Kafka

2017-08-23 Thread Shixiong(Ryan) Zhu
You can use `bin/pyspark --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0` to start "pyspark". If you want to use "spark-submit", you also need to provide your Python file. On Wed, Aug 23, 2017 at 1:41 PM, Brian Wylie wrote: > Hi All, > > I'm trying the new

Re: PySpark, Structured Streaming and Kafka

2017-08-23 Thread Brian Wylie
Shixiong, Your suggestion works if I use the pyspark-shell directly. In this case I want to setup a Spark Session from within my Jupyter Notebook. My question/issue is related to this SO question: https://stackoverflow.com/questions/35762459/add-jar-to-standalone-pyspark so basically I want to

Re: Training A ML Model on a Huge Dataframe

2017-08-23 Thread Suzen, Mehmet
SGD is supported. I see I assumed you were using Scala. Looks like you can do streaming regression, not sure of pyspark API though: https://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression On 23 August 2017 at 18:22, Sea aj wrote: > Thanks

Re: Livy with Spark package

2017-08-23 Thread Saisai Shao
You could set "spark.jars.packages" in `conf` field of session post API ( https://github.com/apache/incubator-livy/blob/master/docs/rest-api.md#post-sessions). This is equal to --package in spark-submit. BTW you'd better ask livy question in u...@livy.incubator.apache.org. Thanks Jerry On Thu,

Livy with Spark package

2017-08-23 Thread ayan guha
Hi I have a python program which I invoke as spark-submit --packages com.databricks:spark-avro_2.11:3.2.0 somefile.py "2017-08-23 02:00:00" and it works Now I want to submit this file using Livy. I could work out most of the stuff (like putting files to HDFS etc) but not able to

Re: Livy with Spark package

2017-08-23 Thread ayan guha
Thanks and agreed :) On Thu, Aug 24, 2017 at 12:01 PM, Saisai Shao wrote: > You could set "spark.jars.packages" in `conf` field of session post API ( > https://github.com/apache/incubator-livy/blob/master/ > docs/rest-api.md#post-sessions). This is equal to --package in

Joining 2 dataframes, getting result as nested list/structure in dataframe

2017-08-23 Thread JG Perrin
Hi folks, I am trying to join 2 dataframes, but I would like to have the result as a list of rows of the right dataframe (dDf in the example) in a column of the left dataframe (cDf in the example). I made it work with one column, but having issues adding more columns/creating a row(?). Seq

ReduceByKeyAndWindow checkpoint recovery issues in Spark Streaming

2017-08-23 Thread SRK
Hi, ReduceByKeyAndWindow checkpoint recovery has issues when trying to recover for the second time. Basically it is losing the reduced value of the previous window but is present in the old values that needs to be inverse reduced resulting in the following error. Does anyone has any idea as to

PySpark, Structured Streaming and Kafka

2017-08-23 Thread Brian Wylie
Hi All, I'm trying the new hotness of using Kafka and Structured Streaming. Resources that I've looked at - https://spark.apache.org/docs/latest/streaming-programming-guide.html - https://databricks.com/blog/2016/07/28/structured-streaming- in-apache-spark.html -