Schema Evolution for nested Dataset[T]
Hi Spark Users, Suppose I have some data (stored in parquet for example) generated as below: package com.company.entity.old case class Course(id: Int, students: List[Student]) case class Student(name: String) Then usually I can access the data by spark.read.parquet("data.parquet").as[Course] Now I want to add a new field `address` to Student: package com.company.entity.new case class Course(id: Int, students: List[Student]) case class Student(name: String, address: String) Then obviously running `spark.read.parquet("data.parquet").as[Course]` on data generated by the old entity/schema will fail because `address` is missing. In this case, what is the best practice to read data generated with the old entity/schema to the new entity/schema, with the missing field set to some default value? I know I can manually write a function to do the transformation from the old to the new. But it is kind of tedious. Any automatic methods? Thanks, Mike - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: parquet optimal file structure - flat vs nested
Can you give more details on the schema? Is it 6 TB just airport information as below? > On 30. Apr 2017, at 23:08, Zeming Yuwrote: > > I thought relational databases with 6 TB of data can be quite expensive? > >> On 1 May 2017 12:56 am, "Muthu Jayakumar" wrote: >> I am not sure if parquet is a good fit for this? This seems more like filter >> lookup than an aggregate like query. I am curious to see what others have to >> say. >> Would it be more efficient if a relational database with the right index >> (code field in the above case) to perform more efficiently (with spark that >> uses predicate push-down)? >> Hope this helps. >> >> Thanks, >> Muthu >> >>> On Sun, Apr 30, 2017 at 1:45 AM, Zeming Yu wrote: >>> Another question: I need to store airport info in a parquet file and >>> present it when a user makes a query. >>> >>> For example: >>> >>> "airport": { >>> "code": "TPE", >>> "name": "Taipei (Taoyuan Intl.)", >>> "longName": "Taipei, Taiwan >>> (TPE-Taoyuan Intl.)", >>> "city": "Taipei", >>> "localName": "Taoyuan Intl.", >>> "airportCityState": "Taipei, Taiwan" >>> >>> >>> Is it best practice to store just the coce "TPE" and then look up the name >>> "Taipei (Taoyuan Intl.)" from a relational database? Any alternatives? >>> On Sun, Apr 30, 2017 at 6:34 PM, Jörn Franke wrote: Depends on your queries, the data structure etc. generally flat is better, but if your query filter is on the highest level then you may have better performance with a nested structure, but it really depends > On 30. Apr 2017, at 10:19, Zeming Yu wrote: > > Hi, > > We're building a parquet based data lake. I was under the impression > that flat files are more efficient than deeply nested files (say 3 or 4 > levels down). Is that correct? > > Thanks, > Zeming >>> >>
Re: parquet optimal file structure - flat vs nested
You have to find out how the user filters - by code? By airport name? Then you can have the right structure. Although, in the scenario below ORC with bloom filters may have some advantages. It is crucial that you sort the data when inserting it on the columns your user wants to filter. E.g. If filters by code then it needs to be sorted on the code. > On 30. Apr 2017, at 23:08, Zeming Yuwrote: > > I thought relational databases with 6 TB of data can be quite expensive? > >> On 1 May 2017 12:56 am, "Muthu Jayakumar" wrote: >> I am not sure if parquet is a good fit for this? This seems more like filter >> lookup than an aggregate like query. I am curious to see what others have to >> say. >> Would it be more efficient if a relational database with the right index >> (code field in the above case) to perform more efficiently (with spark that >> uses predicate push-down)? >> Hope this helps. >> >> Thanks, >> Muthu >> >>> On Sun, Apr 30, 2017 at 1:45 AM, Zeming Yu wrote: >>> Another question: I need to store airport info in a parquet file and >>> present it when a user makes a query. >>> >>> For example: >>> >>> "airport": { >>> "code": "TPE", >>> "name": "Taipei (Taoyuan Intl.)", >>> "longName": "Taipei, Taiwan >>> (TPE-Taoyuan Intl.)", >>> "city": "Taipei", >>> "localName": "Taoyuan Intl.", >>> "airportCityState": "Taipei, Taiwan" >>> >>> >>> Is it best practice to store just the coce "TPE" and then look up the name >>> "Taipei (Taoyuan Intl.)" from a relational database? Any alternatives? >>> On Sun, Apr 30, 2017 at 6:34 PM, Jörn Franke wrote: Depends on your queries, the data structure etc. generally flat is better, but if your query filter is on the highest level then you may have better performance with a nested structure, but it really depends > On 30. Apr 2017, at 10:19, Zeming Yu wrote: > > Hi, > > We're building a parquet based data lake. I was under the impression > that flat files are more efficient than deeply nested files (say 3 or 4 > levels down). Is that correct? > > Thanks, > Zeming >>> >>
Re: parquet optimal file structure - flat vs nested
I thought relational databases with 6 TB of data can be quite expensive? On 1 May 2017 12:56 am, "Muthu Jayakumar"wrote: > I am not sure if parquet is a good fit for this? This seems more like > filter lookup than an aggregate like query. I am curious to see what others > have to say. > Would it be more efficient if a relational database with the right index > (code field in the above case) to perform more efficiently (with spark that > uses predicate push-down)? > Hope this helps. > > Thanks, > Muthu > > On Sun, Apr 30, 2017 at 1:45 AM, Zeming Yu wrote: > >> Another question: I need to store airport info in a parquet file and >> present it when a user makes a query. >> >> For example: >> >> "airport": { >> "code": "TPE", >> "name": "Taipei (Taoyuan Intl.)", >> "longName": "Taipei, Taiwan >> (TPE-Taoyuan Intl.)", >> "city": "Taipei", >> "localName": "Taoyuan Intl.", >> "airportCityState": "Taipei, >> Taiwan" >> >> >> Is it best practice to store just the coce "TPE" and then look up the >> name "Taipei (Taoyuan Intl.)" from a relational database? Any alternatives? >> >> On Sun, Apr 30, 2017 at 6:34 PM, Jörn Franke >> wrote: >> >>> Depends on your queries, the data structure etc. generally flat is >>> better, but if your query filter is on the highest level then you may have >>> better performance with a nested structure, but it really depends >>> >>> > On 30. Apr 2017, at 10:19, Zeming Yu wrote: >>> > >>> > Hi, >>> > >>> > We're building a parquet based data lake. I was under the impression >>> that flat files are more efficient than deeply nested files (say 3 or 4 >>> levels down). Is that correct? >>> > >>> > Thanks, >>> > Zeming >>> >> >> >
examples of dealing with nested parquet/ dataframe file
Hi, I'm still trying to decide whether to store my data as deeply nested or flat parquet file. The main reason for storing the nested file is it stores data in its raw format, no information loss. I have two questions: 1. Is it always necessary to flatten a nested dataframe for the purpose of building a machine learning model? (I don't want to use the explode function as there's only one response per row) 2. Could anyone point me to a few examples of dealing with deeply nested (say 5 levels deep) dataframes in pyspark?
Re: Recommended cluster parameters
I've got a similar question. Would you be able to provide some rough guide (even a range is fine) on the number of nodes, cores, and total amount of RAM required? Do you want to store 1 TB, 1 PB or far more? - say 6 TB of data in parquet format on s3 Do you want to just read that data, retrieve it then do little work on it and then read it, have a complex machine learning pipeline? - I need to 1) read it and do complex machine learning 2) query the last 3 months of data, visualise it, and come back with answers with seconds of latency On Sun, Apr 30, 2017 at 6:57 PM, yohann jardinwrote: > It really depends on your needs and your data. > > > Do you want to store 1 TB, 1 PB or far more? Do you want to just read that > data, retrieve it then do little work on it and then read it, have a > complex machine learning pipeline? Depending on the workload, the ratio > between cores and storage will vary. > > > First start with a subset of your data and do some tests on your own > computer or (that’s better) with a little cluster of 3 nodes. This will > help you to find your ratio between storage/cores and the needs of memory > that you might expect if you are not using just a subset of your data but > the whole bunch available that you (can) have. > > > Then using this information and indications on Spark website ( > http://spark.apache.org/docs/latest/hardware-provisioning.html), you will > be able to specify the hardware of one node, and how many nodes you need > (at least 3). > > > *Yohann Jardin* > Le 4/30/2017 à 10:26 AM, rakesh sharma a écrit : > > Hi > > I would like to know the details of implementing a cluster. > > What kind of machines one would require, how many nodes, number of cores > etc. > > > thanks > > rakesh > > >
Re: Recommended cluster parameters
It really depends on your needs and your data. Do you want to store 1 TB, 1 PB or far more? Do you want to just read that data, retrieve it then do little work on it and then read it, have a complex machine learning pipeline? Depending on the workload, the ratio between cores and storage will vary. First start with a subset of your data and do some tests on your own computer or (that’s better) with a little cluster of 3 nodes. This will help you to find your ratio between storage/cores and the needs of memory that you might expect if you are not using just a subset of your data but the whole bunch available that you (can) have. Then using this information and indications on Spark website (http://spark.apache.org/docs/latest/hardware-provisioning.html), you will be able to specify the hardware of one node, and how many nodes you need (at least 3). Yohann Jardin Le 4/30/2017 à 10:26 AM, rakesh sharma a écrit : Hi I would like to know the details of implementing a cluster. What kind of machines one would require, how many nodes, number of cores etc. thanks rakesh
Re: parquet optimal file structure - flat vs nested
Another question: I need to store airport info in a parquet file and present it when a user makes a query. For example: "airport": { "code": "TPE", "name": "Taipei (Taoyuan Intl.)", "longName": "Taipei, Taiwan (TPE-Taoyuan Intl.)", "city": "Taipei", "localName": "Taoyuan Intl.", "airportCityState": "Taipei, Taiwan" Is it best practice to store just the coce "TPE" and then look up the name "Taipei (Taoyuan Intl.)" from a relational database? Any alternatives? On Sun, Apr 30, 2017 at 6:34 PM, Jörn Frankewrote: > Depends on your queries, the data structure etc. generally flat is better, > but if your query filter is on the highest level then you may have better > performance with a nested structure, but it really depends > > > On 30. Apr 2017, at 10:19, Zeming Yu wrote: > > > > Hi, > > > > We're building a parquet based data lake. I was under the impression > that flat files are more efficient than deeply nested files (say 3 or 4 > levels down). Is that correct? > > > > Thanks, > > Zeming >
Re: parquet optimal file structure - flat vs nested
Depends on your queries, the data structure etc. generally flat is better, but if your query filter is on the highest level then you may have better performance with a nested structure, but it really depends > On 30. Apr 2017, at 10:19, Zeming Yuwrote: > > Hi, > > We're building a parquet based data lake. I was under the impression that > flat files are more efficient than deeply nested files (say 3 or 4 levels > down). Is that correct? > > Thanks, > Zeming - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Recommended cluster parameters
Hi I would like to know the details of implementing a cluster. What kind of machines one would require, how many nodes, number of cores etc. thanks rakesh
parquet optimal file structure - flat vs nested
Hi, We're building a parquet based data lake. I was under the impression that flat files are more efficient than deeply nested files (say 3 or 4 levels down). Is that correct? Thanks, Zeming
Spark repartition question...
Hello there, I am trying to understand the difference between the following reparition()... a. def repartition(partitionExprs: Column*): Dataset[T] b. def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T] c. def repartition(numPartitions: Int): Dataset[T] My understanding is that (c) is a simpler hash based partitioner where the number of records are equally partitioned into numPartitions. (a) is more like (c) except that the nuumPartitions depends on distinct column values from the expression. right? (b) Similar to (a) but what does numPartitions mean here? On a side note, from the source code, it seems like (a) & (b) uses RepartitionByExpression . And my guess is that (a) would default the numPartitions to 200 (which is the default shuffle partition size) Reason for my question... say df.reparition(50, col("cat_col")) and the distinct `cat_col` for the df is about 20 values. The effective partitions would still be 50? And if it's 50 would the 20 distinct values would most likely get their own bucket of partition, but some of the values can repeat into the remainder of the 30 bucket... Is this loosely correct? The reason for my question is to attempt to fit a large amount of data in memory that would not fit thru all the workers in the cluster. But if I repartition the data in some logical manner, then I would be able to fit the data in the heap to perform some useful joins and write the result back into parquet (or other useful) datastore Please advice, Muthu