date:20180501

Poor performance reading Hive table made of sequence files

2018-05-01 Thread Patrick McCarthy

I recently ran a query with the following form: select a.*, b.* from some_small_table a inner join ( select things from someother table lateral view explode(s) ss as sss where a_key is in (x,y,z) ) b on a.key = b.key where someothercriterion On hive, this query took about five minutes. In

keep getting empty table while using saveAsTable() to save DataFrame as table

2018-05-01 Thread nicholasl

Hi, I am using Spark SQL in a cluster, and try to use the CBO supported in Spark SQL. The dataset that I am using is TPC-DS. In order to collect the statistic of these data, I first load the data from HDFS to create dataframes. Then use saveAsTable() to save dataframes as tables. All the

Re: org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-01 Thread Ryan Blue

This is usually caused by skew. Sometimes you can work around it by in creasing the number of partitions like you tried, but when that doesn’t work you need to change the partitioning that you’re using. If you’re aggregating, try adding an intermediate aggregation. For example, if your query is

Re: Filter one dataset based on values from another

2018-05-01 Thread lsn24

I don't think inner join will solve my problem. *For each row in* paramsDataset, I need to filter mydataset. And then I need to run a bunch of calculation on filtered myDataset. Say for example paramsDataset has three employee age ranges . Eg: 20-30,30-50, 50-60 and regions USA,Canada.

RE: Fast Unit Tests

2018-05-01 Thread Yeikel Santana

Can you share a sample test case? How are you doing the unit tests? Are you creating the session in a beforeAll block or similar? As far as I know, if you use spark you will end up with light integration tests rather than “real” unit tests (please correct me if I am wrong). From:

Re: all calculations finished, but "VCores Used" value remains at its max

2018-05-01 Thread Felix Cheung

Zeppelin keeps the Spark job alive. This is likely a better question for the Zeppelin project. From: Valery Khamenya Sent: Tuesday, May 1, 2018 4:30:24 AM To: user@spark.apache.org Subject: all calculations finished, but "VCores Used" value

Re: [EXT] [Spark 2.x Core] .collect() size limit

2018-05-01 Thread klrmowse

okie, i may have found an alternate/workaround to using .collect() for what i am trying to achieve... initially, for the Spark application that i am working on, i would call .collect() on two separate RDDs into a couple of ArrayLists (which was the reason i was asking what the size limit on the

Re: Fast Unit Tests

2018-05-01 Thread Geoff Von Allmen

I am pretty new to spark/scala myself, but I just recently implemented unit tests to test my transformations/aggregations and such myself. I’m using the mrpowers spark-fast-tests and spark-daria libraries. I

Fast Unit Tests

2018-05-01 Thread marcos rebelo

Hey all, We are using Scala and SQL heavily, but I have a problem with VERY SLOW Unit Tests. Is there a way to do fast Unit Tests on Spark? How are you guys going around it? Best Regards Marcos Rebelo

Re: Dataframe vs dataset

2018-05-01 Thread Michael Artz

I get your point haha and I also think of it as DataFrame being a specific kind of Dataset. Mike On Tue, May 1, 2018, 7:27 AM Lalwani, Jayesh wrote: > Neither. > > > > All women are humans. Not all humans are women. You wouldn’t say that a > woman is a subset of a

smarter way to "forget" DataFrame definition and stick to its values

2018-05-01 Thread Valery Khamenya

hi all a short example before the long story: var accumulatedDataFrame = ... // initialize for (i <- 1 to 100) { val myTinyNewData = ... // my slowly calculated new data portion in tiny amounts accumulatedDataFrame = accumulatedDataFrame.union(myTinyNewData) // how to stick

ApacheCon North America 2018 schedule is now live.

2018-05-01 Thread Rich Bowen

Dear Apache Enthusiast, We are pleased to announce our schedule for ApacheCon North America 2018. ApacheCon will be held September 23-27 at the Montreal Marriott Chateau Champlain in Montreal, Canada. Registration is open! The early bird rate of $575 lasts until July 21, at which time it

Re: Dataframe vs dataset

2018-05-01 Thread Lalwani, Jayesh

Neither. All women are humans. Not all humans are women. You wouldn’t say that a woman is a subset of a human. All DataFrames are DataSets. Not all Datasets are DataFrames. The “subset” relationship doesn’t apply here. A DataFrame is a specialized type of DataSet From: Michael Artz

Re: Filter one dataset based on values from another

2018-05-01 Thread Lalwani, Jayesh

What columns do you want to filter myDataSet on? What are the corresponding columns in paramsDataSet? You can easily do what you want using a inner join. For example, if tempview and paramsview both have a column, say employeeID. You can do this with the SQl sparkSession.sql("Select * from

all calculations finished, but "VCores Used" value remains at its max

2018-05-01 Thread Valery Khamenya

Hi all I experience a strange thing: when Spark 2.3.0 calculations started from Zeppelin 0.7.3 are finished, the "VCores Used" value in resource manager stays at its maximum, albeit nothing is assumed to be calculated anymore. How come? if relevant, I experience this issue since AWS EMR 5.13.0

org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-01 Thread Pralabh Kumar

Hi I am getting the above error in Spark SQL . I have increase (using 5000 ) number of partitions but still getting the same error . My data most probably is skew. org.apache.spark.shuffle.FetchFailedException: Too large frame: 4247124829 at

PySpark.sql.filter not performing as it should

2018-05-01 Thread 880f0464

Hi Everyone, I wonder If someone could be so kind and share some light on this problem: [PySpark.sql.filter not performing as it should](https://stackoverflow.com/q/49995538) Cheers, A. Sent with [ProtonMail](https://protonmail.com) Secure Email.

spark.python.worker.reuse not working as expected

2018-05-01 Thread 880f0464

Hi Everyone, I wonder If someone could be so kind and share some light on this problem: [spark.python.worker.reuse not working as expected](https://stackoverflow.com/q/50043684) Cheers, A. Sent with [ProtonMail](https://protonmail.com) Secure Email.

UnresolvedException: Invalid call to dataType on unresolved object

2018-05-01 Thread 880f0464

Hi Everyone, I wonder If someone could be so kind and share some light on this problem: [UnresolvedException: Invalid call to dataType on unresolved object when using DataSet constructed from Seq.empty (since Spark 2.3.0)](https://stackoverflow.com/q/49757487) Cheers, A. Sent with

Poor performance reading Hive table made of sequence files

keep getting empty table while using saveAsTable() to save DataFrame as table

Re: org.apache.spark.shuffle.FetchFailedException: Too large frame:

Re: Filter one dataset based on values from another

RE: Fast Unit Tests

Re: all calculations finished, but "VCores Used" value remains at its max

Re: [EXT] [Spark 2.x Core] .collect() size limit

Re: Fast Unit Tests

Fast Unit Tests

Re: Dataframe vs dataset

smarter way to "forget" DataFrame definition and stick to its values

ApacheCon North America 2018 schedule is now live.

Re: Dataframe vs dataset

Re: Filter one dataset based on values from another

all calculations finished, but "VCores Used" value remains at its max

org.apache.spark.shuffle.FetchFailedException: Too large frame:

PySpark.sql.filter not performing as it should

spark.python.worker.reuse not working as expected

UnresolvedException: Invalid call to dataType on unresolved object

19 matches

Site Navigation

Mail list logo

Footer information