Questions regarding Jobs, Stages and Caching

2017-05-24 Thread ramnavan
Hi, I’m new to Spark and trying to understand the inner workings of Spark in the below mentioned scenarios. I’m using PySpark and Spark 2.1.1 Spark.read.json(): I am running executing this line “spark.read.json(‘s3a:///*.json’)” and a cluster with three worker nodes (AWS M4.xlarge

Re: Spark 2 Kafka Direct Stream Consumer Issue

2017-05-24 Thread Jayadeep J
Could any of the experts kindly advise ? On Fri, May 19, 2017 at 6:00 PM, Jayadeep J wrote: > Hi , > > I would appreciate some advice regarding an issue we are facing in > Streaming Kafka Direct Consumer. > > We have recently upgraded our application with Kafka Direct

Re: Running into the same problem as JIRA SPARK-19268

2017-05-24 Thread kant kodali
Hi All, I specified hdfsCheckPointDir = /usr/local/hadoop/checkpoint as you can see below however I dont see checkpoint directory under my hadoop_home= /usr/local/hadoop in either datanodes or namenodes however in datanode machine there seems to be some data under

Re: Running into the same problem as JIRA SPARK-19268

2017-05-24 Thread Shixiong(Ryan) Zhu
What's the value of "hdfsCheckPointDir"? Could you list this directory on HDFS and report the files there? On Wed, May 24, 2017 at 3:50 PM, Michael Armbrust wrote: > -dev > > Have you tried clearing out the checkpoint directory? Can you also give > the full stack trace?

Re: Running into the same problem as JIRA SPARK-19268

2017-05-24 Thread Michael Armbrust
-dev Have you tried clearing out the checkpoint directory? Can you also give the full stack trace? On Wed, May 24, 2017 at 3:45 PM, kant kodali wrote: > Even if I do simple count aggregation like below I get the same error as >

Re: Running into the same problem as JIRA SPARK-19268

2017-05-24 Thread kant kodali
Even if I do simple count aggregation like below I get the same error as https://issues.apache.org/jira/browse/SPARK-19268 Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 hours", "24 hours"), df1.col("AppName")).count(); On Wed, May 24, 2017 at 3:35 PM, kant kodali

Re: One question / kerberos, yarn-cluster -> connection to hbase

2017-05-24 Thread Michael Gummelt
What version of Spark are you using? Can you provide your logs with DEBUG logging enabled? You should see these logs: https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L475 On Wed, May 24, 2017 at 10:07 AM, Sudhir Jangir

Re: KMeans Clustering is not Reproducible

2017-05-24 Thread Christoph Brücke
Hi Ankur, thank you for answering. But my problem is not, that I'm stuck in a local extrema but rather the reproducibility of kmeans. Want I'm trying to achieve is: when the input data and all the parameters stay the same, especially the seed, I want to get the exact same results. Even though the

Re: KMeans Clustering is not Reproducible

2017-05-24 Thread Yu Zhang
I agree with what Ankur said. The kmeans seeding program ('takeSample' method) runs in parallel, so each partition has its sampling points based on the local data which will cause the 'partition agnostic'. The seeding method is based on Bahmani et al. kmeansII algorithm which gives approximation

[PySpark] - Broadcast Variable Pickle Registry Usage?

2017-05-24 Thread Michael Mansour (CS)
Hi all, I’m poking around the Pyspark.Broadcast module, and I notice that one can pass in a `pickle_registry` and a `path`. The documentation does not outline the pickle registry use and I’m curious about how to use it, and if there are any advantages to it. Thanks, Michael Mansour

Re: KMeans Clustering is not Reproducible

2017-05-24 Thread Ankur Srivastava
Hi Christoph, I am not an expert in ML and have not used Spark KMeans but your problem seems to be an issue of local minimum vs global minimum. You should run K-means multiple times with random starting point and also try with multiple values of K (unless you are already sure). Hope this helps.

One question / kerberos, yarn-cluster -> connection to hbase

2017-05-24 Thread Sudhir Jangir
Facing one issue with Kerberos enabled Hadoop/CDH cluster. We are trying to run a streaming job on yarn-cluster, which interacts with Kafka (direct stream), and hbase. Somehow, we are not able to connect to hbase in the cluster mode. We use keytab to login to hbase. This is what we

Re: scalastyle violation on mvn install but not on mvn package

2017-05-24 Thread Xiangyu Li
I downloaded a source code distribution of spark-2.1.0 and did the install again, and this time I did not see any warnings. I must have used some modified code before. Thank you for the help! On Tue, May 23, 2017 at 11:19 AM, Mark Hamstra wrote: > > > On Tue, May 23,

Dependencies for starting Master / Worker in maven

2017-05-24 Thread Jens Teglhus Møller
Hi I just joined a project that runs on spark-1.6.1 and I have no prior spark experience. The project build is quite fragile when it comes to runtime dependencies. Often the project builds fine but after deployment we end up with ClassNotFoundException's or NoSuchMethodError's when submitting a

Re: KMeans Clustering is not Reproducible

2017-05-24 Thread Christoph Bruecke
Hi Anastasios, thanks for the reply but caching doesn’t seem to change anything. After further investigation it really seems that the RDD#takeSample method is the cause of the non-reproducibility. Is this considered a bug and should I open an Issue for that? BTW: my example script contains a