Schema Evolution for nested Dataset[T]

2017-04-30 Thread Mike Wheeler
Hi Spark Users,

Suppose I have some data (stored in parquet for example) generated as below:

package com.company.entity.old
case class Course(id: Int, students: List[Student])
case class Student(name: String)

Then usually I can access the data by

spark.read.parquet("data.parquet").as[Course]

Now I want to add a new field `address` to Student:

package com.company.entity.new
case class Course(id: Int, students: List[Student])
case class Student(name: String, address: String)

Then obviously running `spark.read.parquet("data.parquet").as[Course]`
on data generated by the old entity/schema will fail because `address`
is missing.

In this case, what is the best practice to read data generated with
the old entity/schema to the new entity/schema, with the missing field
set to some default value? I know I can manually write a function to
do the transformation from the old to the new. But it is kind of
tedious. Any automatic methods?

Thanks,

Mike

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Jörn Franke
Can you give more details on the schema? Is it 6 TB just airport information as 
below? 

> On 30. Apr 2017, at 23:08, Zeming Yu  wrote:
> 
> I thought relational databases with 6 TB of data can be quite expensive?
> 
>> On 1 May 2017 12:56 am, "Muthu Jayakumar"  wrote:
>> I am not sure if parquet is a good fit for this? This seems more like filter 
>> lookup than an aggregate like query. I am curious to see what others have to 
>> say.
>> Would it be more efficient if a relational database with the right index 
>> (code field in the above case) to perform more efficiently (with spark that 
>> uses predicate push-down)? 
>> Hope this helps.
>> 
>> Thanks,
>> Muthu
>> 
>>> On Sun, Apr 30, 2017 at 1:45 AM, Zeming Yu  wrote:
>>> Another question: I need to store airport info in a parquet file and 
>>> present it when a user makes a query. 
>>> 
>>> For example:
>>> 
>>> "airport": {
>>> "code": "TPE",
>>> "name": "Taipei (Taoyuan Intl.)",
>>> "longName": "Taipei, Taiwan 
>>> (TPE-Taoyuan Intl.)",
>>> "city": "Taipei",
>>> "localName": "Taoyuan Intl.",
>>> "airportCityState": "Taipei, Taiwan"
>>> 
>>> 
>>> Is it best practice to store just the coce "TPE" and then look up the name 
>>> "Taipei (Taoyuan Intl.)" from a relational database? Any alternatives?
>>> 
 On Sun, Apr 30, 2017 at 6:34 PM, Jörn Franke  wrote:
 Depends on your queries, the data structure etc. generally flat is better, 
 but if your query filter is on the highest level then you may have better 
 performance with a nested structure, but it really depends
 
 > On 30. Apr 2017, at 10:19, Zeming Yu  wrote:
 >
 > Hi,
 >
 > We're building a parquet based data lake. I was under the impression 
 > that flat files are more efficient than deeply nested files (say 3 or 4 
 > levels down). Is that correct?
 >
 > Thanks,
 > Zeming
>>> 
>> 


Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Jörn Franke
You have to find out how the user filters - by code? By airport name? Then you 
can have the right structure. Although, in the scenario below ORC with bloom 
filters may have some advantages.
It is crucial that you sort the data when inserting it on the columns your user 
wants to filter. E.g. If filters by code then it needs to be sorted on the code.


> On 30. Apr 2017, at 23:08, Zeming Yu  wrote:
> 
> I thought relational databases with 6 TB of data can be quite expensive?
> 
>> On 1 May 2017 12:56 am, "Muthu Jayakumar"  wrote:
>> I am not sure if parquet is a good fit for this? This seems more like filter 
>> lookup than an aggregate like query. I am curious to see what others have to 
>> say.
>> Would it be more efficient if a relational database with the right index 
>> (code field in the above case) to perform more efficiently (with spark that 
>> uses predicate push-down)? 
>> Hope this helps.
>> 
>> Thanks,
>> Muthu
>> 
>>> On Sun, Apr 30, 2017 at 1:45 AM, Zeming Yu  wrote:
>>> Another question: I need to store airport info in a parquet file and 
>>> present it when a user makes a query. 
>>> 
>>> For example:
>>> 
>>> "airport": {
>>> "code": "TPE",
>>> "name": "Taipei (Taoyuan Intl.)",
>>> "longName": "Taipei, Taiwan 
>>> (TPE-Taoyuan Intl.)",
>>> "city": "Taipei",
>>> "localName": "Taoyuan Intl.",
>>> "airportCityState": "Taipei, Taiwan"
>>> 
>>> 
>>> Is it best practice to store just the coce "TPE" and then look up the name 
>>> "Taipei (Taoyuan Intl.)" from a relational database? Any alternatives?
>>> 
 On Sun, Apr 30, 2017 at 6:34 PM, Jörn Franke  wrote:
 Depends on your queries, the data structure etc. generally flat is better, 
 but if your query filter is on the highest level then you may have better 
 performance with a nested structure, but it really depends
 
 > On 30. Apr 2017, at 10:19, Zeming Yu  wrote:
 >
 > Hi,
 >
 > We're building a parquet based data lake. I was under the impression 
 > that flat files are more efficient than deeply nested files (say 3 or 4 
 > levels down). Is that correct?
 >
 > Thanks,
 > Zeming
>>> 
>> 


Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Zeming Yu
I thought relational databases with 6 TB of data can be quite expensive?

On 1 May 2017 12:56 am, "Muthu Jayakumar"  wrote:

> I am not sure if parquet is a good fit for this? This seems more like
> filter lookup than an aggregate like query. I am curious to see what others
> have to say.
> Would it be more efficient if a relational database with the right index
> (code field in the above case) to perform more efficiently (with spark that
> uses predicate push-down)?
> Hope this helps.
>
> Thanks,
> Muthu
>
> On Sun, Apr 30, 2017 at 1:45 AM, Zeming Yu  wrote:
>
>> Another question: I need to store airport info in a parquet file and
>> present it when a user makes a query.
>>
>> For example:
>>
>> "airport": {
>> "code": "TPE",
>> "name": "Taipei (Taoyuan Intl.)",
>> "longName": "Taipei, Taiwan
>> (TPE-Taoyuan Intl.)",
>> "city": "Taipei",
>> "localName": "Taoyuan Intl.",
>> "airportCityState": "Taipei,
>> Taiwan"
>>
>>
>> Is it best practice to store just the coce "TPE" and then look up the
>> name "Taipei (Taoyuan Intl.)" from a relational database? Any alternatives?
>>
>> On Sun, Apr 30, 2017 at 6:34 PM, Jörn Franke 
>> wrote:
>>
>>> Depends on your queries, the data structure etc. generally flat is
>>> better, but if your query filter is on the highest level then you may have
>>> better performance with a nested structure, but it really depends
>>>
>>> > On 30. Apr 2017, at 10:19, Zeming Yu  wrote:
>>> >
>>> > Hi,
>>> >
>>> > We're building a parquet based data lake. I was under the impression
>>> that flat files are more efficient than deeply nested files (say 3 or 4
>>> levels down). Is that correct?
>>> >
>>> > Thanks,
>>> > Zeming
>>>
>>
>>
>


examples of dealing with nested parquet/ dataframe file

2017-04-30 Thread Zeming Yu
Hi,

I'm still trying to decide whether to store my data as deeply nested or
flat parquet file.

The main reason for storing the nested file is it stores data in its raw
format, no information loss.

I have two questions:

1. Is it always necessary to flatten a nested dataframe for the purpose of
building a machine learning model? (I don't want to use the explode
function as there's only one response per row)

2. Could anyone point me to a few examples of dealing with deeply nested
(say 5 levels deep) dataframes in pyspark?


Re: Recommended cluster parameters

2017-04-30 Thread Zeming Yu
I've got a similar question. Would you be able to provide some rough guide
(even a range is fine) on the number of nodes, cores, and total amount of
RAM required?

Do you want to store 1 TB, 1 PB or far more?

- say 6 TB of data in parquet format on s3


Do you want to just read that data, retrieve it then do little work on it
and then read it, have a complex machine learning pipeline?

- I need to 1) read it and do complex machine learning 2) query the last 3
months of data, visualise it, and come back with answers with seconds of
latency




On Sun, Apr 30, 2017 at 6:57 PM, yohann jardin 
wrote:

> It really depends on your needs and your data.
>
>
> Do you want to store 1 TB, 1 PB or far more? Do you want to just read that
> data, retrieve it then do little work on it and then read it, have a
> complex machine learning pipeline? Depending on the workload, the ratio
> between cores and storage will vary.
>
>
> First start with a subset of your data and do some tests on your own
> computer or (that’s better) with a little cluster of 3 nodes. This will
> help you to find your ratio between storage/cores and the needs of memory
> that you might expect if you are not using just a subset of your data but
> the whole bunch available that you (can) have.
>
>
> Then using this information and indications on Spark website (
> http://spark.apache.org/docs/latest/hardware-provisioning.html), you will
> be able to specify the hardware of one node, and how many nodes you need
> (at least 3).
>
>
> *Yohann Jardin*
> Le 4/30/2017 à 10:26 AM, rakesh sharma a écrit :
>
> Hi
>
> I would like to know the details of implementing a cluster.
>
> What kind of machines one would require, how many nodes, number of cores
> etc.
>
>
> thanks
>
> rakesh
>
>
>


Re: Recommended cluster parameters

2017-04-30 Thread yohann jardin
It really depends on your needs and your data.


Do you want to store 1 TB, 1 PB or far more? Do you want to just read that 
data, retrieve it then do little work on it and then read it, have a complex 
machine learning pipeline? Depending on the workload, the ratio between cores 
and storage will vary.


First start with a subset of your data and do some tests on your own computer 
or (that’s better) with a little cluster of 3 nodes. This will help you to find 
your ratio between storage/cores and the needs of memory that you might expect 
if you are not using just a subset of your data but the whole bunch available 
that you (can) have.


Then using this information and indications on Spark website 
(http://spark.apache.org/docs/latest/hardware-provisioning.html), you will be 
able to specify the hardware of one node, and how many nodes you need (at least 
3).


Yohann Jardin

Le 4/30/2017 à 10:26 AM, rakesh sharma a écrit :

Hi

I would like to know the details of implementing a cluster.

What kind of machines one would require, how many nodes, number of cores etc.


thanks

rakesh



Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Zeming Yu
Another question: I need to store airport info in a parquet file and
present it when a user makes a query.

For example:

"airport": {
"code": "TPE",
"name": "Taipei (Taoyuan Intl.)",
"longName": "Taipei, Taiwan
(TPE-Taoyuan Intl.)",
"city": "Taipei",
"localName": "Taoyuan Intl.",
"airportCityState": "Taipei, Taiwan"


Is it best practice to store just the coce "TPE" and then look up the name
"Taipei (Taoyuan Intl.)" from a relational database? Any alternatives?

On Sun, Apr 30, 2017 at 6:34 PM, Jörn Franke  wrote:

> Depends on your queries, the data structure etc. generally flat is better,
> but if your query filter is on the highest level then you may have better
> performance with a nested structure, but it really depends
>
> > On 30. Apr 2017, at 10:19, Zeming Yu  wrote:
> >
> > Hi,
> >
> > We're building a parquet based data lake. I was under the impression
> that flat files are more efficient than deeply nested files (say 3 or 4
> levels down). Is that correct?
> >
> > Thanks,
> > Zeming
>


Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Jörn Franke
Depends on your queries, the data structure etc. generally flat is better, but 
if your query filter is on the highest level then you may have better 
performance with a nested structure, but it really depends

> On 30. Apr 2017, at 10:19, Zeming Yu  wrote:
> 
> Hi,
> 
> We're building a parquet based data lake. I was under the impression that 
> flat files are more efficient than deeply nested files (say 3 or 4 levels 
> down). Is that correct?
> 
> Thanks,
> Zeming

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Recommended cluster parameters

2017-04-30 Thread rakesh sharma
Hi

I would like to know the details of implementing a cluster.

What kind of machines one would require, how many nodes, number of cores etc.


thanks

rakesh


parquet optimal file structure - flat vs nested

2017-04-30 Thread Zeming Yu
Hi,

We're building a parquet based data lake. I was under the impression that
flat files are more efficient than deeply nested files (say 3 or 4 levels
down). Is that correct?

Thanks,
Zeming


Spark repartition question...

2017-04-30 Thread Muthu Jayakumar
Hello there,

I am trying to understand the difference between the following
reparition()...
a. def repartition(partitionExprs: Column*): Dataset[T]
b. def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]
c. def repartition(numPartitions: Int): Dataset[T]

My understanding is that (c) is a simpler hash based partitioner where the
number of records are equally partitioned into numPartitions.
(a) is more like (c) except that the nuumPartitions depends on distinct
column values from the expression. right?
(b) Similar to (a) but what does numPartitions mean here?

On a side note, from the source code, it seems like (a) & (b) uses
RepartitionByExpression  . And my guess is that (a) would default the
numPartitions to 200 (which is the default shuffle partition size)

Reason for my question...
say df.reparition(50, col("cat_col"))
and the distinct `cat_col` for the df is about 20 values. The effective
partitions would still be 50? And if it's 50 would the 20 distinct values
would most likely get their own bucket of partition, but some of the values
can repeat into the remainder of the 30 bucket... Is this loosely correct?

The reason for my question is to attempt to fit a large amount of data in
memory that would not fit thru all the workers in the cluster. But if I
repartition the data in some logical manner, then I would be able to fit
the data in the heap to perform some useful joins and write the result back
into parquet (or other useful) datastore

Please advice,
Muthu