RE: Writing DataFrame filter results to separate files

2016-12-05 Thread Mendelson, Assaf
If you write to parquet you can use the partitionBy option which would write under a directory for each value of the column (assuming you have a column with the month). From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Tuesday, December 06, 2016 3:33 AM To: Everett Anderson Cc: user

driver in queued state and not started

2016-12-05 Thread Yu Wei
Hi Guys, I tried to run spark on mesos cluster. However, when I tried to submit jobs via spark-submit. The driver is in "Queued state" and not started. Which should I check? Thanks, Jared, (??) Software developer Interested in open source software, big data, Linux

Re: Monitoring the User Metrics for a long running Spark Job

2016-12-05 Thread Chawla,Sumit
An example implementation i found is : https://github.com/groupon/spark-metrics Anyone has any experience using this? I am more interested in something for Pyspark specifically. The above link pointed to - https://github.com/apache/spark/blob/master/conf/metrics.properties.template. I need to

Re: Monitoring the User Metrics for a long running Spark Job

2016-12-05 Thread Miguel Morales
One thing I've done before is to install datadogs statsd agent on the nodes. Then you can emit metrics and stats to it and build dashboards on datadog. Sent from my iPhone > On Dec 5, 2016, at 8:17 PM, Chawla,Sumit wrote: > > Hi Manish > > I am specifically looking

Re: Monitoring the User Metrics for a long running Spark Job

2016-12-05 Thread Chawla,Sumit
Hi Manish I am specifically looking for something similar to following: https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/common/index.html#accumulators--counters. Flink has this concept of Accumulators, where user can keep its custom counters etc. While the application is

Re: Unsubscribe

2016-12-05 Thread mehak soni
unsubscribe On Sat, Dec 3, 2016 at 2:55 PM, kote rao wrote: > unsubscribe > -- > *From:* S Malligarjunan > *Sent:* Saturday, December 3, 2016 11:55:41 AM > *To:* user@spark.apache.org > *Subject:* Re: Unsubscribe >

Re: Can spark support exactly once based kafka ? Due to these following question?

2016-12-05 Thread Cody Koeninger
Have you read / watched the materials linked from https://github.com/koeninger/kafka-exactly-once On Mon, Dec 5, 2016 at 4:17 AM, Jörn Franke wrote: > You need to do the book keeping of what has been processed yourself. This > may mean roughly the following (of course the

Re: Back-pressure to Spark Kafka Streaming?

2016-12-05 Thread Cody Koeninger
If you want finer-grained max rate setting, SPARK-17510 got merged a while ago. There's also SPARK-18580 which might help address the issue of starting backpressure rate for the first batch. On Mon, Dec 5, 2016 at 4:18 PM, Liren Ding wrote: > Hey all, > > Does

Re: Monitoring the User Metrics for a long running Spark Job

2016-12-05 Thread manish ranjan
http://spark.apache.org/docs/latest/monitoring.html You can even install tools like dstat , iostat , and iotop , *collectd* can provide fine-grained profiling on individual nodes. If

Re: custom generate spark application id

2016-12-05 Thread Jakob Odersky
The app ID is assigned internally by spark's task scheduler https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala#L35. You could probably change the naming, however I'm pretty sure that the ID will always have to be unique for a context on a

RE: [Spark Streaming] How to do join two messages in spark streaming(Probabaly messasges are in differnet RDD) ?

2016-12-05 Thread Sanchuan Cheng (sancheng)
smime.p7m Description: S/MIME encrypted message

[Spark Streaming] How to do join two messages in spark streaming(Probabaly messasges are in differnet RDD) ?

2016-12-05 Thread sancheng
Hello, we are trying to use Spark streaming to do some billing related application. so our case is that we need to correlate two different messages, and calculate the time invterval between two messages, the two message should be in same partition but probabaly not in the same RDD, it seems

Re: Writing DataFrame filter results to separate files

2016-12-05 Thread Michael Armbrust
> > 1. In my case, I'd need to first explode my data by ~12x to assign each > record to multiple 12-month rolling output windows. I'm not sure Spark SQL > would be able to optimize this away, combining it with the output writing > to do it incrementally. > You are right, but I wouldn't worry

Re: Writing DataFrame filter results to separate files

2016-12-05 Thread Everett Anderson
Hi, Thanks for the reply! On Mon, Dec 5, 2016 at 1:30 PM, Michael Armbrust wrote: > If you repartition($"column") and then do .write.partitionBy("column") you > should end up with a single file for each value of the partition column. > I have two concerns there: 1. In

Re: Spark-shell doesn't see changes coming from Kafka topic

2016-12-05 Thread Otávio Carvalho
In the end, the mistake I made was that I forgot to setup the proper export AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY on the machine I was running the spark-shell. Nevertheless, thanks for answering, Tathagata Das. Otávio. 2016-12-01 17:36 GMT-02:00 Tathagata Das

custom generate spark application id

2016-12-05 Thread rtijoriwala
Hi, We would like to control how spark generates its application id. Currently, it changes everytime we restart the job and also hard to correlate. For e.g. it looks like this - app-20161129054045-0096. I would like to control how this id gets generated to its easier to track when jmx metrics

Please unsubscribe me

2016-12-05 Thread Srinivas Potluri

Monitoring the User Metrics for a long running Spark Job

2016-12-05 Thread Chawla,Sumit
Hi All I have a long running job which takes hours and hours to process data. How can i monitor the operational efficency of this job? I am interested in something like Storm\Flink style User metrics/aggregators, which i can monitor while my job is running. Using these metrics i want to

Re: Back-pressure to Spark Kafka Streaming?

2016-12-05 Thread Richard Startin
I've seen the feature work very well. For tuning, you've got: spark.streaming.backpressure.pid.proportional (defaults to 1, non-negative) - weight for response to "error" (change between last batch and this batch) spark.streaming.backpressure.pid.integral (defaults to 0.2, non-negative) -

Streaming audio files

2016-12-05 Thread habibbaluwala
I have a HDFS folder that keeps on getting new audio files every few minutes. My objective is to detect new files that have been added to the folder, and then process the files in parallel without splitting it into multiple blocks. Basically, if there are 4 new audio files added, I want the Spark

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-12-05 Thread Hyukjin Kwon
Hi Kant, Ah, I thought you wanted to find the workaround to so it. Then wouldn't this be easily able to reach the same goal with the workaround without new such API? Thanks. On 6 Dec 2016 4:11 a.m., "kant kodali" wrote: > Hi Kwon, > > Thanks for this but Isn't this

Back-pressure to Spark Kafka Streaming?

2016-12-05 Thread Liren Ding
Hey all, Does backressure actually work on spark kafka streaming? According to the latest spark streaming document: *http://spark.apache.org/docs/latest/streaming-programming-guide.html * "*In Spark 1.5, we have introduced a

Re: Pretrained Word2Vec models

2016-12-05 Thread Robin East
There is a JIRA and a pull request for this - https://issues.apache.org/jira/browse/SPARK-15328 - however there has been no movement on this for a few months. As you’ll see from the pull request review I was able to load the Google News Model but could not get it to run.

Re: Collaborative Filtering Implicit Feedback Impl.

2016-12-05 Thread Sean Owen
As I recall, it is in there in the math, but doesn't appear as an explicit term in the computation. You don't actually materialize the 0 input or the "c=1" corresponding to them. Or: do you have a computation that agrees with the paper but not this code? Put another way, none of this would

Re: Collaborative Filtering Implicit Feedback Impl.

2016-12-05 Thread Jerry Lam
Hi Sean, I agree there is no need for that if the implementation actually assigns c=1 for all missing ratings but from the current implementation of ALS, I don't think it is doing that. The idea is that for missing ratings, they are assigned to c=1 (in the paper) and they do contribute to the

Pretrained Word2Vec models

2016-12-05 Thread Lee Becker
Hi all, Is there a way for Spark to load Word2Vec models trained using gensim or the original C implementation of Word2Vec? Specifically I'd like to play with the Google News model

Re: Writing DataFrame filter results to separate files

2016-12-05 Thread Michael Armbrust
If you repartition($"column") and then do .write.partitionBy("column") you should end up with a single file for each value of the partition column. On Mon, Dec 5, 2016 at 10:59 AM, Everett Anderson wrote: > Hi, > > I have a DataFrame of records with dates, and I'd like

Re: Collaborative Filtering Implicit Feedback Impl.

2016-12-05 Thread Sean Owen
That doesn't mean this 0 value is literally included in the input. There's no need for that. On Tue, Dec 6, 2016 at 4:24 AM Jerry Lam wrote: > Hi Sean, > > I'm referring to the paper (http://yifanhu.net/PUB/cf.pdf) Section 2: > " However, with implicit feedback it would be

Re: Collaborative Filtering Implicit Feedback Impl.

2016-12-05 Thread Jerry Lam
Hi Sean, I'm referring to the paper (http://yifanhu.net/PUB/cf.pdf) Section 2: " However, with implicit feedback it would be natural to assign values to all rui variables. If no action was observed rui is set to zero, thus meaning in our examples zero watching time, or zero purchases on record."

Re: Livy with Spark

2016-12-05 Thread Vadim Semenov
You mean share a single spark context across multiple jobs? https://github.com/spark-jobserver/spark-jobserver does the same On Mon, Dec 5, 2016 at 9:33 AM, Mich Talebzadeh wrote: > Hi, > > Has there been any experience using Livy with Spark to share multiple > Spark

Re: Collaborative Filtering Implicit Feedback Impl.

2016-12-05 Thread Sean Owen
What are you referring to in what paper? implicit input would never materialize 0s for missing values. On Tue, Dec 6, 2016 at 3:42 AM Jerry Lam wrote: > Hello spark users and developers, > > I read the paper from Yahoo about CF with implicit feedback and other > papers

Collaborative Filtering Implicit Feedback Impl.

2016-12-05 Thread Jerry Lam
Hello spark users and developers, I read the paper from Yahoo about CF with implicit feedback and other papers using implicit feedbacks. Their implementation require to set the missing rating with 0. That is for unobserved ratings, the confidence for those is set to 1 (c=1). Therefore, the matrix

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-12-05 Thread kant kodali
Hi Kwon, Thanks for this but Isn't this what Michael suggested? Thanks, kant On Mon, Dec 5, 2016 at 4:45 AM, Hyukjin Kwon wrote: > Hi Kant, > > How about doing something like this? > > import org.apache.spark.sql.functions._ > > // val df2 =

Writing DataFrame filter results to separate files

2016-12-05 Thread Everett Anderson
Hi, I have a DataFrame of records with dates, and I'd like to write all 12-month (with overlap) windows to separate outputs. Currently, I have a loop equivalent to: for ((windowStart, windowEnd) <- windows) { val windowData = allData.filter( getFilterCriteria(windowStart,

Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-05 Thread Marcelo Vanzin
That's not the error, that's just telling you the application failed. You have to look at the YARN logs for application_1479877553404_0041 to see why it failed. On Mon, Dec 5, 2016 at 10:44 AM, Gerard Casey wrote: > Thanks Marcelo, > > My understanding from a few

Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-05 Thread Gerard Casey
Thanks Marcelo, My understanding from a few pointers is that this may be due to insufficient read permissions to the key tab or a corrupt key tab. I have checked the read permissions and they are ok. I can see that it is initially configuring correctly: INFO

Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-05 Thread Marcelo Vanzin
There's generally an exception in these cases, and you haven't posted it, so it's hard to tell you what's wrong. The most probable cause, without the extra information the exception provides, is that you're using the wrong Hadoop configuration when submitting the job to YARN. On Mon, Dec 5, 2016

Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-05 Thread Jorge Sánchez
Hi Gerard, have you tried running in yarn-client mode? If so, do you still get that same error? Regards. 2016-12-05 12:49 GMT+00:00 Gerard Casey : > Edit. From here > > I > read that

Re: Would spark dataframe/rdd read from external source on every action?

2016-12-05 Thread neil90
Yes it would. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Would-spark-dataframe-rdd-read-from-external-source-on-every-action-tp28157p28158.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

streaming deployment on yarn -emr

2016-12-05 Thread Saurabh Malviya (samalviy)
Hi, We are using EMR and using oozie right now to deploy streaming job (Workflow). I just want to know best practice to deploy streaming job. (In mesos we deploy using marathon, but what should be best approach in yarn which enforce only once instance and restart if it fails for any reason)

How could one specify a Docker image for each job to be used by executors?

2016-12-05 Thread Enno Shioji
Hi, Suppose I have a job that uses some native libraries. I can launch executors using a Docker container and everything is fine. Now suppose I have some other job that uses some other native libraries (and let's assume they just can't co-exist in the same docker image), but I want to execute

Re: Livy with Spark

2016-12-05 Thread Mich Talebzadeh
Thanks Richard for the link. Also its interaction with Zeppelin is great. I believe it is a very early stage for now Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Spark streaming completed batches statistics

2016-12-05 Thread Richard Startin
Is there any way to get a more computer friendly version of the completes batches section of the streaming page of the application master? I am very interested in the statistics and am currently screen-scraping... https://richardstartin.com

Re: Livy with Spark

2016-12-05 Thread Richard Startin
There is a great write up on Livy at http://henning.kropponline.de/2016/11/06/ On 5 Dec 2016, at 14:34, Mich Talebzadeh > wrote: Hi, Has there been any experience using Livy with Spark to share multiple Spark contexts? thanks Dr

Livy with Spark

2016-12-05 Thread Mich Talebzadeh
Hi, Has there been any experience using Livy with Spark to share multiple Spark contexts? thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-05 Thread Gerard Casey
Edit. From here I read that you can pass a `key tab` option to spark-submit. I thus tried spark-submit --class "graphx_sp" --master yarn --keytab /path/to/keytab --deploy-mode cluster --executor-memory 13G

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-12-05 Thread Hyukjin Kwon
Hi Kant, How about doing something like this? import org.apache.spark.sql.functions._ // val df2 = df.select(df("body").cast(StringType).as("body")) val df2 = Seq("""{"a": 1}""").toDF("body") val schema = spark.read.json(df2.as[String].rdd).schema df2.select(from_json(col("body"),

Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-05 Thread Gerard Casey
Hello all, I am using Spark with Kerberos authentication. I can run my code using `spark-shell` fine and I can also use `spark-submit` in local mode (e.g. —master local[16]). Both function as expected. local mode - spark-submit --class "graphx_sp" --master local[16] --driver-memory

Re: Access multiple cluster

2016-12-05 Thread Steve Loughran
if the remote filesystem is visible from the other, than a different HDFS value, e.g hdfs://analytics:8000/historical/ can be used for reads & writes, even if your defaultFS (the one where you get max performance) is, say hdfs://processing:8000/ -performance will be slower, in both directions

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-12-05 Thread kant kodali
Hi Michael, " Personally, I usually take a small sample of data and use schema inference on that. I then hardcode that schema into my program. This makes your spark jobs much faster and removes the possibility of the schema changing underneath the covers." This may or may not work for us. Not

Re: Can spark support exactly once based kafka ? Due to these following question?

2016-12-05 Thread Jörn Franke
You need to do the book keeping of what has been processed yourself. This may mean roughly the following (of course the devil is in the details): Write down in zookeeper which part of the processing job has been done and for which dataset all the data has been created (do not keep the data

Re: Can spark support exactly once based kafka ? Due to these following question?

2016-12-05 Thread Piotr Smoliński
The boundary is a bit flexible. In terms of observed DStream effective state the direct stream semantics is exactly-once. In terms of external system observations (like message emission), Spark Streaming semantics is at-least-once. Regards, Piotr On Mon, Dec 5, 2016 at 8:59 AM, Michal Šenkýř

Re: How to convert a unix timestamp column into date format(yyyy-MM-dd) ?

2016-12-05 Thread Devi P.V
Thanks.It works. On Mon, Dec 5, 2016 at 2:03 PM, Michal Šenkýř wrote: > Yet another approach: > scala> val df1 = df.selectExpr("client_id", > "from_unixtime(ts/1000,'-MM-dd') > as ts") > > Mgr. Michal Šenkýřmike.sen...@gmail.com > +420 605 071 818 > > On 5.12.2016

Re: How to convert a unix timestamp column into date format(yyyy-MM-dd) ?

2016-12-05 Thread Michal Šenkýř
Yet another approach: scala> val df1 = df.selectExpr("client_id", "from_unixtime(ts/1000,'-MM-dd') as ts") Mgr. Michal Šenkýř mike.sen...@gmail.com +420 605 071 818 On 5.12.2016 09:22, Deepak Sharma wrote: Another simpler approach will be: scala> val findf = sqlContext.sql("select

Re: How to convert a unix timestamp column into date format(yyyy-MM-dd) ?

2016-12-05 Thread Deepak Sharma
Another simpler approach will be: scala> val findf = sqlContext.sql("select client_id,from_unixtime(ts/1000,'-MM-dd') ts from ts") findf: org.apache.spark.sql.DataFrame = [client_id: string, ts: string] scala> findf.show ++--+ | client_id|ts|

Re: How to convert a unix timestamp column into date format(yyyy-MM-dd) ?

2016-12-05 Thread Deepak Sharma
This is the correct way to do it.The timestamp that you mentioned was not correct: scala> val ts1 = from_unixtime($"ts"/1000, "-MM-dd") ts1: org.apache.spark.sql.Column = fromunixtime((ts / 1000),-MM-dd) scala> val finaldf = df.withColumn("ts1",ts1) finaldf:

Re: How to convert a unix timestamp column into date format(yyyy-MM-dd) ?

2016-12-05 Thread Deepak Sharma
This is how you can do it in scala: scala> val ts1 = from_unixtime($"ts", "-MM-dd") ts1: org.apache.spark.sql.Column = fromunixtime(ts,-MM-dd) scala> val finaldf = df.withColumn("ts1",ts1) finaldf: org.apache.spark.sql.DataFrame = [client_id: string, ts: string, ts1: string] scala>

Re: How to convert a unix timestamp column into date format(yyyy-MM-dd) ?

2016-12-05 Thread Devi P.V
Hi, Thanks for replying to my question. I am using scala On Mon, Dec 5, 2016 at 1:20 PM, Marco Mistroni wrote: > Hi > In python you can use date time.fromtimestamp(..). > strftime('%Y%m%d') > Which spark API are you using? > Kr > > On 5 Dec 2016 7:38 am, "Devi