date:20170502

[ANNOUNCE] Apache Spark 2.1.1

2017-05-02 Thread Michael Armbrust

We are happy to announce the availability of Spark 2.1.1! Apache Spark 2.1.1 is a maintenance release, based on the branch-2.1 maintenance branch of Spark. We strongly recommend all 2.1.x users to upgrade to this stable release. To download Apache Spark 2.1.1 visit

Re: Joins in Spark

2017-05-02 Thread Angel Francisco Orta

Sorry, I had a typo I mean repartitionby("fieldofjoin) El 2 may. 2017 9:44 p. m., "KhajaAsmath Mohammed" escribió: Hi Angel, I am trying using the below code but i dont see partition on the dataframe. val iftaGPSLocation_df = sqlContext.sql(iftaGPSLocQry)

Re: Joins in Spark

2017-05-02 Thread KhajaAsmath Mohammed

Hi Angel, I am trying using the below code but i dont see partition on the dataframe. val iftaGPSLocation_df = sqlContext.sql(iftaGPSLocQry) import sqlContext._ import sqlContext.implicits._ datapoint_prq_df.join(geoCacheLoc_df) Val tableA =

Re: [Spark Streaming] Dynamic Broadcast Variable Update

2017-05-02 Thread Tim Smith

One, I think, you should take this to the spark developer list. Two, I suspect broadcast variables aren't the best solution for the use case, you describe. Maybe an in-memory data/object/file store like tachyon is a better fit. Thanks, Tim On Tue, May 2, 2017 at 11:56 AM, Nipun Arora

[Spark Streaming] Dynamic Broadcast Variable Update

2017-05-02 Thread Nipun Arora

Hi All, To support our Spark Streaming based anomaly detection tool, we have made a patch in Spark 1.6.2 to dynamically update broadcast variables. I'll first explain our use-case, which I believe should be common to several people using Spark Streaming applications. Broadcast variables are

Re: do we need to enable writeAheadLogs for DirectStream as well or is it only for indirect stream?

2017-05-02 Thread Cody Koeninger

You don't need write ahead logs for direct stream. On Tue, May 2, 2017 at 11:32 AM, kant kodali wrote: > Hi All, > > I need some fault tolerance for my stateful computations and I am wondering > why we need to enable writeAheadLogs for DirectStream like Kafka (for > Indirect

Re: Initialize Gaussian Mixture Model using Spark ML dataframe API

2017-05-02 Thread Tim Smith

Yes, I noticed these open issues, both with KMeans and GMM: https://issues.apache.org/jira/browse/SPARK-13025 Thanks, Tim On Mon, May 1, 2017 at 9:01 PM, Yanbo Liang wrote: > Hi Tim, > > Spark ML API doesn't support set initial model for GMM currently. I wish > we can get

Re: Driver spins hours in query plan optimization

2017-05-02 Thread Everett Anderson

Seems like https://issues.apache.org/jira/browse/SPARK-13346 is likely the same issue. Seems like for some people persist() doesn't work and they have to convert to RDDs and back. On Fri, Apr 14, 2017 at 1:39 PM, Everett Anderson wrote: > Hi, > > We keep hitting a

Re: Schema Evolution for nested Dataset[T]

2017-05-02 Thread Mike Wheeler

Hi Michael, Thank you for the suggestions. I am wondering how I can make `withColumn` to handle nested structure? For example, below is my code to generate the data. I basically add the `age` field to `Person2`, which is nested in an Array for Course2. Then I want to fill in 0 for age with age

Re: Joins in Spark

2017-05-02 Thread Angel Francisco Orta

Have you tried to make partition by join's field and run it by segments, filtering both tables at the same segments of data? Example: Val tableA = DfA.partitionby("joinField").filter("firstSegment") Val tableB= DfB.partitionby("joinField").filter("firstSegment") TableA.join(TableB) El 2

Re: Joins in Spark

2017-05-02 Thread KhajaAsmath Mohammed

Table 1 (192 GB) is partitioned by year and month ... 192 GB of data is for one month i.e. for April Table 2: 92 GB not partitioned . I have to perform join on these tables now. On Tue, May 2, 2017 at 1:27 PM, Angel Francisco Orta < angel.francisco.o...@gmail.com> wrote: > Hello, > > Is the

Re: Joins in Spark

2017-05-02 Thread Angel Francisco Orta

Hello, Is the tables partitioned? If yes, what is the partition field? Thanks El 2 may. 2017 8:22 p. m., "KhajaAsmath Mohammed" escribió: Hi, I am trying to join two big tables in spark and the job is running for quite a long time without any results. Table 1:

Joins in Spark

2017-05-02 Thread KhajaAsmath Mohammed

Hi, I am trying to join two big tables in spark and the job is running for quite a long time without any results. Table 1: 192GB Table 2: 92 GB Does anyone have better solution to get the results fast? Thanks, Asmath

do we need to enable writeAheadLogs for DirectStream as well or is it only for indirect stream?

2017-05-02 Thread kant kodali

Hi All, I need some fault tolerance for my stateful computations and I am wondering why we need to enable writeAheadLogs for DirectStream like Kafka (for Indirect stream it makes sense). In case of driver failure DirectStream such as Kafka can pull the messages again from the last committed

Re: --jars does not take remote jar?

2017-05-02 Thread Nan Zhu

I see.Thanks! On Tue, May 2, 2017 at 9:12 AM, Marcelo Vanzin wrote: > On Tue, May 2, 2017 at 9:07 AM, Nan Zhu wrote: > > I have no easy way to pass jar path to those forked Spark > > applications? (except that I download jar from a remote path

Re: --jars does not take remote jar?

2017-05-02 Thread Marcelo Vanzin

On Tue, May 2, 2017 at 9:07 AM, Nan Zhu wrote: > I have no easy way to pass jar path to those forked Spark > applications? (except that I download jar from a remote path to a local temp > dir after resolving some permission issues, etc.?) Yes, that's the only way

Re: --jars does not take remote jar?

2017-05-02 Thread Nan Zhu

Thanks for the reply! If I have an application master which starts some Spark applications by forking processes (in yarn-client mode) Essentially I have no easy way to pass jar path to those forked Spark applications? (except that I download jar from a remote path to a local temp dir after

Re: --jars does not take remote jar?

2017-05-02 Thread Marcelo Vanzin

Remote jars are added to executors' classpaths, but not the driver's. In YARN cluster mode, they would also be added to the driver's class path. On Tue, May 2, 2017 at 8:43 AM, Nan Zhu wrote: > Hi, all > > For some reason, I tried to pass in a HDFS path to the --jars

--jars does not take remote jar?

2017-05-02 Thread Nan Zhu

Hi, all For some reason, I tried to pass in a HDFS path to the --jars option in spark-submit According to the document, http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management, --jars would accept remote path However, in the implementation,

OutOfMemoryError

2017-05-02 Thread TwUxTLi51Nus

Hi Spark Users, I have a dataset with ~5M rows x 20 columns, containing a groupID and a rowID. My goal is to check whether (some) columns contain more than a fixed fraction (say, 50%) of missing (null) values within a group. If this is found, the entire column is set to missing (null), for

[ANNOUNCE] Apache Spark 2.1.1

Re: Joins in Spark

Re: Joins in Spark

Re: [Spark Streaming] Dynamic Broadcast Variable Update

[Spark Streaming] Dynamic Broadcast Variable Update

Re: do we need to enable writeAheadLogs for DirectStream as well or is it only for indirect stream?

Re: Initialize Gaussian Mixture Model using Spark ML dataframe API

Re: Driver spins hours in query plan optimization

Re: Schema Evolution for nested Dataset[T]

Re: Joins in Spark

Re: Joins in Spark

Re: Joins in Spark

Joins in Spark

do we need to enable writeAheadLogs for DirectStream as well or is it only for indirect stream?

Re: --jars does not take remote jar?

Re: --jars does not take remote jar?

Re: --jars does not take remote jar?

Re: --jars does not take remote jar?

--jars does not take remote jar?

OutOfMemoryError

20 matches

Site Navigation

Mail list logo

Footer information