Poor performance reading Hive table made of sequence files

2018-05-01 Thread Patrick McCarthy
I recently ran a query with the following form: select a.*, b.* from some_small_table a inner join ( select things from someother table lateral view explode(s) ss as sss where a_key is in (x,y,z) ) b on a.key = b.key where someothercriterion On hive, this query took about five minutes. In

keep getting empty table while using saveAsTable() to save DataFrame as table

2018-05-01 Thread nicholasl
Hi, I am using Spark SQL in a cluster, and try to use the CBO supported in Spark SQL. The dataset that I am using is TPC-DS. In order to collect the statistic of these data, I first load the data from HDFS to create dataframes. Then use saveAsTable() to save dataframes as tables. All the

Re: org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-01 Thread Ryan Blue
This is usually caused by skew. Sometimes you can work around it by in creasing the number of partitions like you tried, but when that doesn’t work you need to change the partitioning that you’re using. If you’re aggregating, try adding an intermediate aggregation. For example, if your query is

Re: Filter one dataset based on values from another

2018-05-01 Thread lsn24
I don't think inner join will solve my problem. *For each row in* paramsDataset, I need to filter mydataset. And then I need to run a bunch of calculation on filtered myDataset. Say for example paramsDataset has three employee age ranges . Eg: 20-30,30-50, 50-60 and regions USA,Canada.

RE: Fast Unit Tests

2018-05-01 Thread Yeikel Santana
Can you share a sample test case? How are you doing the unit tests? Are you creating the session in a beforeAll block or similar? As far as I know, if you use spark you will end up with light integration tests rather than “real” unit tests (please correct me if I am wrong). From:

Re: all calculations finished, but "VCores Used" value remains at its max

2018-05-01 Thread Felix Cheung
Zeppelin keeps the Spark job alive. This is likely a better question for the Zeppelin project. From: Valery Khamenya Sent: Tuesday, May 1, 2018 4:30:24 AM To: user@spark.apache.org Subject: all calculations finished, but "VCores Used" value

Re: [EXT] [Spark 2.x Core] .collect() size limit

2018-05-01 Thread klrmowse
okie, i may have found an alternate/workaround to using .collect() for what i am trying to achieve... initially, for the Spark application that i am working on, i would call .collect() on two separate RDDs into a couple of ArrayLists (which was the reason i was asking what the size limit on the

Re: Fast Unit Tests

2018-05-01 Thread Geoff Von Allmen
I am pretty new to spark/scala myself, but I just recently implemented unit tests to test my transformations/aggregations and such myself. I’m using the mrpowers spark-fast-tests and spark-daria libraries. I

Fast Unit Tests

2018-05-01 Thread marcos rebelo
Hey all, We are using Scala and SQL heavily, but I have a problem with VERY SLOW Unit Tests. Is there a way to do fast Unit Tests on Spark? How are you guys going around it? Best Regards Marcos Rebelo

Re: Dataframe vs dataset

2018-05-01 Thread Michael Artz
I get your point haha and I also think of it as DataFrame being a specific kind of Dataset. Mike On Tue, May 1, 2018, 7:27 AM Lalwani, Jayesh wrote: > Neither. > > > > All women are humans. Not all humans are women. You wouldn’t say that a > woman is a subset of a

smarter way to "forget" DataFrame definition and stick to its values

2018-05-01 Thread Valery Khamenya
hi all a short example before the long story: var accumulatedDataFrame = ... // initialize for (i <- 1 to 100) { val myTinyNewData = ... // my slowly calculated new data portion in tiny amounts accumulatedDataFrame = accumulatedDataFrame.union(myTinyNewData) // how to stick

ApacheCon North America 2018 schedule is now live.

2018-05-01 Thread Rich Bowen
Dear Apache Enthusiast, We are pleased to announce our schedule for ApacheCon North America 2018. ApacheCon will be held September 23-27 at the Montreal Marriott Chateau Champlain in Montreal, Canada. Registration is open! The early bird rate of $575 lasts until July 21, at which time it

Re: Dataframe vs dataset

2018-05-01 Thread Lalwani, Jayesh
Neither. All women are humans. Not all humans are women. You wouldn’t say that a woman is a subset of a human. All DataFrames are DataSets. Not all Datasets are DataFrames. The “subset” relationship doesn’t apply here. A DataFrame is a specialized type of DataSet From: Michael Artz

Re: Filter one dataset based on values from another

2018-05-01 Thread Lalwani, Jayesh
What columns do you want to filter myDataSet on? What are the corresponding columns in paramsDataSet? You can easily do what you want using a inner join. For example, if tempview and paramsview both have a column, say employeeID. You can do this with the SQl sparkSession.sql("Select * from

all calculations finished, but "VCores Used" value remains at its max

2018-05-01 Thread Valery Khamenya
Hi all I experience a strange thing: when Spark 2.3.0 calculations started from Zeppelin 0.7.3 are finished, the "VCores Used" value in resource manager stays at its maximum, albeit nothing is assumed to be calculated anymore. How come? if relevant, I experience this issue since AWS EMR 5.13.0

org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-01 Thread Pralabh Kumar
Hi I am getting the above error in Spark SQL . I have increase (using 5000 ) number of partitions but still getting the same error . My data most probably is skew. org.apache.spark.shuffle.FetchFailedException: Too large frame: 4247124829 at

PySpark.sql.filter not performing as it should

2018-05-01 Thread 880f0464
Hi Everyone, I wonder If someone could be so kind and share some light on this problem: [PySpark.sql.filter not performing as it should](https://stackoverflow.com/q/49995538) Cheers, A. Sent with [ProtonMail](https://protonmail.com) Secure Email.

spark.python.worker.reuse not working as expected

2018-05-01 Thread 880f0464
Hi Everyone, I wonder If someone could be so kind and share some light on this problem: [spark.python.worker.reuse not working as expected](https://stackoverflow.com/q/50043684) Cheers, A. Sent with [ProtonMail](https://protonmail.com) Secure Email.

UnresolvedException: Invalid call to dataType on unresolved object

2018-05-01 Thread 880f0464
Hi Everyone, I wonder If someone could be so kind and share some light on this problem: [UnresolvedException: Invalid call to dataType on unresolved object when using DataSet constructed from Seq.empty (since Spark 2.3.0)](https://stackoverflow.com/q/49757487) Cheers, A. Sent with