[ANNOUNCE] Apache Spark 2.1.1

2017-05-02 Thread Michael Armbrust
We are happy to announce the availability of Spark 2.1.1! Apache Spark 2.1.1 is a maintenance release, based on the branch-2.1 maintenance branch of Spark. We strongly recommend all 2.1.x users to upgrade to this stable release. To download Apache Spark 2.1.1 visit

Re: Joins in Spark

2017-05-02 Thread Angel Francisco Orta
Sorry, I had a typo I mean repartitionby("fieldofjoin) El 2 may. 2017 9:44 p. m., "KhajaAsmath Mohammed" escribió: Hi Angel, I am trying using the below code but i dont see partition on the dataframe. val iftaGPSLocation_df = sqlContext.sql(iftaGPSLocQry)

Re: Joins in Spark

2017-05-02 Thread KhajaAsmath Mohammed
Hi Angel, I am trying using the below code but i dont see partition on the dataframe. val iftaGPSLocation_df = sqlContext.sql(iftaGPSLocQry) import sqlContext._ import sqlContext.implicits._ datapoint_prq_df.join(geoCacheLoc_df) Val tableA =

Re: [Spark Streaming] Dynamic Broadcast Variable Update

2017-05-02 Thread Tim Smith
One, I think, you should take this to the spark developer list. Two, I suspect broadcast variables aren't the best solution for the use case, you describe. Maybe an in-memory data/object/file store like tachyon is a better fit. Thanks, Tim On Tue, May 2, 2017 at 11:56 AM, Nipun Arora

[Spark Streaming] Dynamic Broadcast Variable Update

2017-05-02 Thread Nipun Arora
Hi All, To support our Spark Streaming based anomaly detection tool, we have made a patch in Spark 1.6.2 to dynamically update broadcast variables. I'll first explain our use-case, which I believe should be common to several people using Spark Streaming applications. Broadcast variables are

Re: do we need to enable writeAheadLogs for DirectStream as well or is it only for indirect stream?

2017-05-02 Thread Cody Koeninger
You don't need write ahead logs for direct stream. On Tue, May 2, 2017 at 11:32 AM, kant kodali wrote: > Hi All, > > I need some fault tolerance for my stateful computations and I am wondering > why we need to enable writeAheadLogs for DirectStream like Kafka (for > Indirect

Re: Initialize Gaussian Mixture Model using Spark ML dataframe API

2017-05-02 Thread Tim Smith
Yes, I noticed these open issues, both with KMeans and GMM: https://issues.apache.org/jira/browse/SPARK-13025 Thanks, Tim On Mon, May 1, 2017 at 9:01 PM, Yanbo Liang wrote: > Hi Tim, > > Spark ML API doesn't support set initial model for GMM currently. I wish > we can get

Re: Driver spins hours in query plan optimization

2017-05-02 Thread Everett Anderson
Seems like https://issues.apache.org/jira/browse/SPARK-13346 is likely the same issue. Seems like for some people persist() doesn't work and they have to convert to RDDs and back. On Fri, Apr 14, 2017 at 1:39 PM, Everett Anderson wrote: > Hi, > > We keep hitting a

Re: Schema Evolution for nested Dataset[T]

2017-05-02 Thread Mike Wheeler
Hi Michael, Thank you for the suggestions. I am wondering how I can make `withColumn` to handle nested structure? For example, below is my code to generate the data. I basically add the `age` field to `Person2`, which is nested in an Array for Course2. Then I want to fill in 0 for age with age

Re: Joins in Spark

2017-05-02 Thread Angel Francisco Orta
Have you tried to make partition by join's field and run it by segments, filtering both tables at the same segments of data? Example: Val tableA = DfA.partitionby("joinField").filter("firstSegment") Val tableB= DfB.partitionby("joinField").filter("firstSegment") TableA.join(TableB) El 2

Re: Joins in Spark

2017-05-02 Thread KhajaAsmath Mohammed
Table 1 (192 GB) is partitioned by year and month ... 192 GB of data is for one month i.e. for April Table 2: 92 GB not partitioned . I have to perform join on these tables now. On Tue, May 2, 2017 at 1:27 PM, Angel Francisco Orta < angel.francisco.o...@gmail.com> wrote: > Hello, > > Is the

Re: Joins in Spark

2017-05-02 Thread Angel Francisco Orta
Hello, Is the tables partitioned? If yes, what is the partition field? Thanks El 2 may. 2017 8:22 p. m., "KhajaAsmath Mohammed" escribió: Hi, I am trying to join two big tables in spark and the job is running for quite a long time without any results. Table 1:

Joins in Spark

2017-05-02 Thread KhajaAsmath Mohammed
Hi, I am trying to join two big tables in spark and the job is running for quite a long time without any results. Table 1: 192GB Table 2: 92 GB Does anyone have better solution to get the results fast? Thanks, Asmath

do we need to enable writeAheadLogs for DirectStream as well or is it only for indirect stream?

2017-05-02 Thread kant kodali
Hi All, I need some fault tolerance for my stateful computations and I am wondering why we need to enable writeAheadLogs for DirectStream like Kafka (for Indirect stream it makes sense). In case of driver failure DirectStream such as Kafka can pull the messages again from the last committed

Re: --jars does not take remote jar?

2017-05-02 Thread Nan Zhu
I see.Thanks! On Tue, May 2, 2017 at 9:12 AM, Marcelo Vanzin wrote: > On Tue, May 2, 2017 at 9:07 AM, Nan Zhu wrote: > > I have no easy way to pass jar path to those forked Spark > > applications? (except that I download jar from a remote path

Re: --jars does not take remote jar?

2017-05-02 Thread Marcelo Vanzin
On Tue, May 2, 2017 at 9:07 AM, Nan Zhu wrote: > I have no easy way to pass jar path to those forked Spark > applications? (except that I download jar from a remote path to a local temp > dir after resolving some permission issues, etc.?) Yes, that's the only way

Re: --jars does not take remote jar?

2017-05-02 Thread Nan Zhu
Thanks for the reply! If I have an application master which starts some Spark applications by forking processes (in yarn-client mode) Essentially I have no easy way to pass jar path to those forked Spark applications? (except that I download jar from a remote path to a local temp dir after

Re: --jars does not take remote jar?

2017-05-02 Thread Marcelo Vanzin
Remote jars are added to executors' classpaths, but not the driver's. In YARN cluster mode, they would also be added to the driver's class path. On Tue, May 2, 2017 at 8:43 AM, Nan Zhu wrote: > Hi, all > > For some reason, I tried to pass in a HDFS path to the --jars

--jars does not take remote jar?

2017-05-02 Thread Nan Zhu
Hi, all For some reason, I tried to pass in a HDFS path to the --jars option in spark-submit According to the document, http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management, --jars would accept remote path However, in the implementation,

OutOfMemoryError

2017-05-02 Thread TwUxTLi51Nus
Hi Spark Users, I have a dataset with ~5M rows x 20 columns, containing a groupID and a rowID. My goal is to check whether (some) columns contain more than a fixed fraction (say, 50%) of missing (null) values within a group. If this is found, the entire column is set to missing (null), for