Hi,
I’m new to Spark and trying to understand the inner workings of Spark in the
below mentioned scenarios. I’m using PySpark and Spark 2.1.1
Spark.read.json():
I am running executing this line
“spark.read.json(‘s3a:///*.json’)” and a cluster with three
worker nodes (AWS M4.xlarge
Could any of the experts kindly advise ?
On Fri, May 19, 2017 at 6:00 PM, Jayadeep J wrote:
> Hi ,
>
> I would appreciate some advice regarding an issue we are facing in
> Streaming Kafka Direct Consumer.
>
> We have recently upgraded our application with Kafka Direct
Hi All,
I specified hdfsCheckPointDir = /usr/local/hadoop/checkpoint as you can see
below however I dont see checkpoint directory under my hadoop_home=
/usr/local/hadoop in either datanodes or namenodes however in datanode
machine there seems to be some data under
What's the value of "hdfsCheckPointDir"? Could you list this directory on
HDFS and report the files there?
On Wed, May 24, 2017 at 3:50 PM, Michael Armbrust
wrote:
> -dev
>
> Have you tried clearing out the checkpoint directory? Can you also give
> the full stack trace?
-dev
Have you tried clearing out the checkpoint directory? Can you also give
the full stack trace?
On Wed, May 24, 2017 at 3:45 PM, kant kodali wrote:
> Even if I do simple count aggregation like below I get the same error as
>
Even if I do simple count aggregation like below I get the same error as
https://issues.apache.org/jira/browse/SPARK-19268
Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"),
"24 hours", "24 hours"), df1.col("AppName")).count();
On Wed, May 24, 2017 at 3:35 PM, kant kodali
What version of Spark are you using? Can you provide your logs with DEBUG
logging enabled? You should see these logs:
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L475
On Wed, May 24, 2017 at 10:07 AM, Sudhir Jangir
Hi Ankur,
thank you for answering. But my problem is not, that I'm stuck in a local
extrema but rather the reproducibility of kmeans. Want I'm trying to
achieve is: when the input data and all the parameters stay the same,
especially the seed, I want to get the exact same results. Even though the
I agree with what Ankur said. The kmeans seeding program ('takeSample'
method) runs in parallel, so each partition has its sampling points based
on the local data which will cause the 'partition agnostic'. The seeding
method is based on Bahmani et al. kmeansII algorithm which gives
approximation
Hi all,
I’m poking around the Pyspark.Broadcast module, and I notice that one can pass
in a `pickle_registry` and a `path`. The documentation does not outline the
pickle registry use and I’m curious about how to use it, and if there are any
advantages to it.
Thanks,
Michael Mansour
Hi Christoph,
I am not an expert in ML and have not used Spark KMeans but your problem
seems to be an issue of local minimum vs global minimum. You should run
K-means multiple times with random starting point and also try with
multiple values of K (unless you are already sure).
Hope this helps.
Facing one issue with Kerberos enabled Hadoop/CDH cluster.
We are trying to run a streaming job on yarn-cluster, which interacts with
Kafka (direct stream), and hbase.
Somehow, we are not able to connect to hbase in the cluster mode. We use keytab
to login to hbase.
This is what we
I downloaded a source code distribution of spark-2.1.0 and did the install
again, and this time I did not see any warnings. I must have used some
modified code before. Thank you for the help!
On Tue, May 23, 2017 at 11:19 AM, Mark Hamstra
wrote:
>
>
> On Tue, May 23,
Hi
I just joined a project that runs on spark-1.6.1 and I have no prior spark
experience.
The project build is quite fragile when it comes to runtime dependencies.
Often the project builds fine but after deployment we end up with
ClassNotFoundException's or NoSuchMethodError's when submitting a
Hi Anastasios,
thanks for the reply but caching doesn’t seem to change anything.
After further investigation it really seems that the RDD#takeSample method is
the cause of the non-reproducibility.
Is this considered a bug and should I open an Issue for that?
BTW: my example script contains a
15 matches
Mail list logo