Poor performance reading Hive table made of sequence files

2018-05-01 Thread Patrick McCarthy
I recently ran a query with the following form:

select a.*, b.*
from some_small_table a
inner join
(
  select things from someother table
  lateral view explode(s) ss as sss
  where a_key is in (x,y,z)
) b
on a.key = b.key
where someothercriterion

On hive, this query took about five minutes. In Spark, using either the
same syntax in a spark.sql call or using the dataframe API, it appeared as
if it was going to take on the order of 10 hours. I didn't let it finish.

The data underlying the hive table are sequence files, ~30mb each, ~1000 to
a partition, and my query ran over only five partitions. A single partition
is about 25gb.

How can Spark perform so badly? Do I need to handle sequence files in a
special way?


keep getting empty table while using saveAsTable() to save DataFrame as table

2018-05-01 Thread nicholasl
Hi,
I am using Spark SQL in a cluster, and try to use the CBO supported in
Spark SQL. The dataset that I am using is TPC-DS. In order to collect the
statistic of these data, I first load the data from HDFS to create
dataframes. Then use saveAsTable() to save dataframes as tables. All the
process seems to be fine, and I didn't get any exceptions. However, when I
was trying to check those created tables, all of them is empty. 

Question: Did anyone have the same problem when you are using the CBO
provided in Spark SQL? Please help me with this issue. Thank you very much.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-01 Thread Ryan Blue
This is usually caused by skew. Sometimes you can work around it by in
creasing the number of partitions like you tried, but when that doesn’t
work you need to change the partitioning that you’re using.

If you’re aggregating, try adding an intermediate aggregation. For example,
if your query is select sum(x), a from t group by a, then try select
sum(partial), a from (select sum(x) as partial, a, b from t group by a, b)
group by a.

rb
​

On Tue, May 1, 2018 at 4:21 AM, Pralabh Kumar 
wrote:

> Hi
>
> I am getting the above error in Spark SQL . I have increase (using 5000 )
> number of partitions but still getting the same error .
>
> My data most probably is skew.
>
>
>
> org.apache.spark.shuffle.FetchFailedException: Too large frame: 4247124829
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:419)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:349)
>
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: Filter one dataset based on values from another

2018-05-01 Thread lsn24
I don't think inner join will solve my problem.

*For each row in* paramsDataset, I need to filter mydataset. And then I need
to run a bunch of calculation on filtered myDataset.

Say for example paramsDataset has three employee age ranges . Eg:
20-30,30-50, 50-60 and regions USA,Canada.

myDataset has all employees information for three years. Like the days a
person came to work , took day off etc.

I need to calculate the average number of days employee worked per age range
for different regions. Average day off per age range etc.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: Fast Unit Tests

2018-05-01 Thread Yeikel Santana
Can you share a sample test case? How are you doing the unit tests? Are you 
creating the session in a beforeAll block or similar? 

 

As far as I know, if you use spark you will end up with light integration tests 
rather than “real” unit tests (please correct me if I am wrong). 

 

From: marcos rebelo  
Sent: Tuesday, May 1, 2018 11:25 AM
To: user 
Subject: Fast Unit Tests

 

Hey all,

 

We are using Scala and SQL heavily, but I have a problem with VERY SLOW Unit 
Tests. 

 

Is there a way to do fast Unit Tests on Spark? 

 

How are you guys going around it?

 

Best Regards

Marcos Rebelo



Re: all calculations finished, but "VCores Used" value remains at its max

2018-05-01 Thread Felix Cheung
Zeppelin keeps the Spark job alive. This is likely a better question for the 
Zeppelin project.


From: Valery Khamenya 
Sent: Tuesday, May 1, 2018 4:30:24 AM
To: user@spark.apache.org
Subject: all calculations finished, but "VCores Used" value remains at its max

Hi all

I experience a strange thing: when Spark 2.3.0 calculations started from 
Zeppelin 0.7.3 are finished, the "VCores Used" value in resource manager stays 
at its maximum, albeit nothing is assumed to be calculated anymore. How come?

if relevant, I experience this issue since AWS EMR 5.13.0

best regards
--
Valery


best regards
--
Valery A.Khamenya


Re: [EXT] [Spark 2.x Core] .collect() size limit

2018-05-01 Thread klrmowse
okie, i may have found an alternate/workaround to using .collect() for what i
am trying to achieve...

initially, for the Spark application that i am working on, i would call
.collect() on two separate RDDs into a couple of ArrayLists (which was the
reason i was asking what the size limit on the driver is)

i need to map the 1st rdd to the 2nd rdd according to a computation/function
- resulting in key-value pairs;

it turns out, i don't need to call .collect() if i instead use
.zipPartitions() - to which i can pass the function to; 

i am currently testing it out...



thanks all for your responses



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Fast Unit Tests

2018-05-01 Thread Geoff Von Allmen
I am pretty new to spark/scala myself, but I just recently implemented unit
tests to test my transformations/aggregations and such myself.

I’m using the mrpowers spark-fast-tests
 and spark-daria
 libraries.

I am also using a JDBC sink in the foreach writer. I’ve mocked the sink to
place the generated MYSQL statements in a global object and then I compare
the output there to an expected set of mysql statements.

I’m running this with FlatSpec ScalaTests, where my spark inputs are
manually generated fixtures for each test case. Everything seems to be
running well and it's nice and quick.
​

On Tue, May 1, 2018 at 8:25 AM, marcos rebelo  wrote:

> Hey all,
>
> We are using Scala and SQL heavily, but I have a problem with VERY SLOW
> Unit Tests.
>
> Is there a way to do fast Unit Tests on Spark?
>
> How are you guys going around it?
>
> Best Regards
> Marcos Rebelo
>


Fast Unit Tests

2018-05-01 Thread marcos rebelo
Hey all,

We are using Scala and SQL heavily, but I have a problem with VERY SLOW
Unit Tests.

Is there a way to do fast Unit Tests on Spark?

How are you guys going around it?

Best Regards
Marcos Rebelo


Re: Dataframe vs dataset

2018-05-01 Thread Michael Artz
I get your point haha and I also think of it as DataFrame being a specific
kind of Dataset.
Mike

On Tue, May 1, 2018, 7:27 AM Lalwani, Jayesh 
wrote:

> Neither.
>
>
>
> All women are humans. Not all humans are women. You wouldn’t say that a
> woman is a subset of a human.
>
>
>
> All DataFrames are DataSets. Not all Datasets are DataFrames. The “subset”
> relationship doesn’t apply here.  A DataFrame is a specialized type of
> DataSet
>
>
>
> *From: *Michael Artz 
> *Date: *Saturday, April 28, 2018 at 9:24 AM
> *To: *"user @spark" 
> *Subject: *Dataframe vs dataset
>
>
>
> Hi,
>
>
>
> I use Spark everyday and I have a good grip on the basics of Spark, so
> this question isnt for myself.  But this came up and I wanted to see what
> other Spark users would say, and I dont want to influence your answer.  And
> SO is weird about polls. The question is
>
>
>
>  "Which one do you feel is accurate... Dataset is a subset of DataFrame,
> or DataFrame a subset of Dataset?"
>
> --
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>


smarter way to "forget" DataFrame definition and stick to its values

2018-05-01 Thread Valery Khamenya
hi all

a short example before the long story:

  var accumulatedDataFrame = ... // initialize

  for (i <- 1 to 100) {
val myTinyNewData = ... // my slowly calculated new data portion in
tiny amounts
accumulatedDataFrame = accumulatedDataFrame.union(myTinyNewData)
// how  to stick here to the values of accumulatedDataFrame only and
forget definitions?!
  }

this kind of stuff is likely to get slower and slower on each iteration
even if myTinyNewData is quite compact. Usually I write accumulatedDataFrame
to S3 and then re-load it back to clear the definition history. It makes
code ugly though. Are there any smarter way?

It happens very often that a DataFrame is created via complex definitions.
The DataFrame is then re-used in several places and sometimes it gets
recalculated triggering a heavy cascade of operations.

Of course one could use .persist or .cache modifiers, but the result is
unfortunately not transparent and instead of speeding up things it results
in slow-down or even lost jobs if storage resources are not enough.

Any advice?

best regards
--
Valery


ApacheCon North America 2018 schedule is now live.

2018-05-01 Thread Rich Bowen

Dear Apache Enthusiast,

We are pleased to announce our schedule for ApacheCon North America 
2018. ApacheCon will be held September 23-27 at the Montreal Marriott 
Chateau Champlain in Montreal, Canada.


Registration is open! The early bird rate of $575 lasts until July 21, 
at which time it goes up to $800. And the room block at the Marriott 
($225 CAD per night, including wifi) closes on August 24th.


We will be featuring more than 100 sessions on Apache projects. The 
schedule is now online at https://apachecon.com/acna18/


The schedule includes full tracks of content from Cloudstack[1], 
Tomcat[2], and our GeoSpatial community[3].


We will have 4 keynote speakers, two of whom are Apache members, and two 
from the wider community.


On Tuesday, Apache member and former board member Cliff Schmidt will be 
speaking about how Amplio uses technology to educate and improve the 
quality of life of people living in very difficult parts of the 
world[4]. And Apache Fineract VP Myrle Krantz will speak about how Open 
Source banking is helping the global fight against poverty[5].


Then, on Wednesday, we’ll hear from Bridget Kromhout, Principal Cloud 
Developer Advocate from Microsoft, about the really hard problem in 
software - the people[6]. And Euan McLeod, ‎VP VIPER at ‎Comcast will 
show us the many ways that Apache software delivers your favorite shows 
to your living room[7].


ApacheCon will also feature old favorites like the Lightning Talks, the 
Hackathon (running the duration of the event), PGP key signing, and lots 
of hallway-track time to get to know your project community better.


Follow us on Twitter, @ApacheCon, and join the disc...@apachecon.com 
mailing list (send email to discuss-subscr...@apachecon.com) to stay up 
to date with developments. And if your company wants to sponsor this 
event, get in touch at h...@apachecon.com for opportunities that are 
still available.


See you in Montreal!

Rich Bowen
VP Conferences, The Apache Software Foundation
h...@apachecon.com
@ApacheCon

[1] http://cloudstackcollab.org/
[2] http://tomcat.apache.org/conference.html
[3] http://apachecon.dukecon.org/acna/2018/#/schedule?search=geospatial
[4] 
http://apachecon.dukecon.org/acna/2018/#/scheduledEvent/df977fd305a31b903
[5] 
http://apachecon.dukecon.org/acna/2018/#/scheduledEvent/22c6c30412a3828d6
[6] 
http://apachecon.dukecon.org/acna/2018/#/scheduledEvent/fbbb2384fa91ebc6b
[7] 
http://apachecon.dukecon.org/acna/2018/#/scheduledEvent/88d50c3613852c2de


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Dataframe vs dataset

2018-05-01 Thread Lalwani, Jayesh
Neither.

All women are humans. Not all humans are women. You wouldn’t say that a woman 
is a subset of a human.

All DataFrames are DataSets. Not all Datasets are DataFrames. The “subset” 
relationship doesn’t apply here.  A DataFrame is a specialized type of DataSet

From: Michael Artz 
Date: Saturday, April 28, 2018 at 9:24 AM
To: "user @spark" 
Subject: Dataframe vs dataset

Hi,

I use Spark everyday and I have a good grip on the basics of Spark, so this 
question isnt for myself.  But this came up and I wanted to see what other 
Spark users would say, and I dont want to influence your answer.  And SO is 
weird about polls. The question is

 "Which one do you feel is accurate... Dataset is a subset of DataFrame, or 
DataFrame a subset of Dataset?"


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: Filter one dataset based on values from another

2018-05-01 Thread Lalwani, Jayesh
What columns do you want to filter myDataSet on? What are the corresponding 
columns in paramsDataSet?

You can easily do what you want using a inner  join. For example, if tempview 
and paramsview both have a column, say employeeID. You can do this with the SQl

sparkSession.sql("Select * from tempview inner join paramsview on 
tempview.employeeId = paramsview.employeeId") 

On 5/1/18, 12:03 AM, "lsn24"  wrote:

Hi,
  I have one  dataset with parameters and another with data that needs to
apply/ filter based on the first dataset (Parameter dataset).

*Scenario is as follows:*

For each row in parameter dataset, I need to apply the parameter row to
the second dataset.I will end up having multiple dataset. for each second
dataset i need to run  a bunch of calculation.

How can I achieve this in spark?

*Pseudo code for better understanding:*

Dataset paramsDataset = sparkSession.sql("select * from
paramsview");

Dataset myDataset = sparkSession.sql("select * from tempview");


Question: For each row in paramsDataset, I need to filter myDataset and run
some calculations on it. Is it possible to do that ? if not whats the best
way to solve it?

Thanks




--
Sent from: 
https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Duser-2Dlist.1001560.n3.nabble.com_=DwICAg=pLULRYW__RtkwsQUPxJVDGboCTdgji3AcHNJU0BpTJE=F2RNeGILvLdBxn7RJ4effes_QFIiEsoVM2rPi9qX1DKow5HQSjq0_WhIW109SXQ4=2DBXMR9Vazi5EAA7gtp78AhvgGj1xwkacIgDWUOOOS4=baasFvkvrjKfQoZTws7KEWp24oBkrLJWvUz1gV5UjFQ=

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org





The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



all calculations finished, but "VCores Used" value remains at its max

2018-05-01 Thread Valery Khamenya
Hi all

I experience a strange thing: when Spark 2.3.0 calculations started from
Zeppelin 0.7.3 are finished, the "VCores Used" value in resource manager
stays at its maximum, albeit nothing is assumed to be calculated anymore.
How come?

if relevant, I experience this issue since AWS EMR 5.13.0

best regards
--
Valery


best regards
--
Valery A.Khamenya


org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-01 Thread Pralabh Kumar
Hi

I am getting the above error in Spark SQL . I have increase (using 5000 )
number of partitions but still getting the same error .

My data most probably is skew.



org.apache.spark.shuffle.FetchFailedException: Too large frame: 4247124829
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:419)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:349)


PySpark.sql.filter not performing as it should

2018-05-01 Thread 880f0464
Hi Everyone,

I wonder If someone could be so kind and share some light on this problem:

[PySpark.sql.filter not performing as it 
should](https://stackoverflow.com/q/49995538)

Cheers,
A.

Sent with [ProtonMail](https://protonmail.com) Secure Email.

spark.python.worker.reuse not working as expected

2018-05-01 Thread 880f0464
Hi Everyone,

I wonder If someone could be so kind and share some light on this problem:

[spark.python.worker.reuse not working as 
expected](https://stackoverflow.com/q/50043684)

Cheers,
A.

Sent with [ProtonMail](https://protonmail.com) Secure Email.

UnresolvedException: Invalid call to dataType on unresolved object

2018-05-01 Thread 880f0464
Hi Everyone,

I wonder If someone could be so kind and share some light on this problem:

[UnresolvedException: Invalid call to dataType on unresolved object when using 
DataSet constructed from Seq.empty (since Spark 
2.3.0)](https://stackoverflow.com/q/49757487)

Cheers,
A.

Sent with [ProtonMail](https://protonmail.com) Secure Email.