Re: [discuss] DataFrame function namespacing
After talking with people on this thread and offline, I've decided to go with option 1, i.e. putting everything in a single "functions" object. On Thu, Apr 30, 2015 at 10:04 AM, Ted Yu wrote: > IMHO I would go with choice #1 > > Cheers > > On Wed, Apr 29, 2015 at 10:03 PM, Reynold Xin wrote: > >> We definitely still have the name collision problem in SQL. >> >> On Wed, Apr 29, 2015 at 10:01 PM, Punyashloka Biswal < >> punya.bis...@gmail.com >> > wrote: >> >> > Do we still have to keep the names of the functions distinct to avoid >> > collisions in SQL? Or is there a plan to allow "importing" a namespace >> into >> > SQL somehow? >> > >> > I ask because if we have to keep worrying about name collisions then I'm >> > not sure what the added complexity of #2 and #3 buys us. >> > >> > Punya >> > >> > On Wed, Apr 29, 2015 at 3:52 PM Reynold Xin >> wrote: >> > >> >> Scaladoc isn't much of a problem because scaladocs are grouped. >> >> Java/Python >> >> is the main problem ... >> >> >> >> See >> >> >> >> >> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$ >> >> >> >> On Wed, Apr 29, 2015 at 3:38 PM, Shivaram Venkataraman < >> >> shiva...@eecs.berkeley.edu> wrote: >> >> >> >> > My feeling is that we should have a handful of namespaces (say 4 or >> 5). >> >> It >> >> > becomes too cumbersome to import / remember more package names and >> >> having >> >> > everything in one package makes it hard to read scaladoc etc. >> >> > >> >> > Thanks >> >> > Shivaram >> >> > >> >> > On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin >> >> wrote: >> >> > >> >> >> To add a little bit more context, some pros/cons I can think of are: >> >> >> >> >> >> Option 1: Very easy for users to find the function, since they are >> all >> >> in >> >> >> org.apache.spark.sql.functions. However, there will be quite a large >> >> >> number >> >> >> of them. >> >> >> >> >> >> Option 2: I can't tell why we would want this one over Option 3, >> since >> >> it >> >> >> has all the problems of Option 3, and not as nice of a hierarchy. >> >> >> >> >> >> Option 3: Opposite of Option 1. Each "package" or static class has a >> >> small >> >> >> number of functions that are relevant to each other, but for some >> >> >> functions >> >> >> it is unclear where they should go (e.g. should "min" go into basic >> or >> >> >> math?) >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin >> >> wrote: >> >> >> >> >> >> > Before we make DataFrame non-alpha, it would be great to decide >> how >> >> we >> >> >> > want to namespace all the functions. There are 3 alternatives: >> >> >> > >> >> >> > 1. Put all in org.apache.spark.sql.functions. This is how SQL does >> >> it, >> >> >> > since SQL doesn't have namespaces. I estimate eventually we will >> >> have ~ >> >> >> 200 >> >> >> > functions. >> >> >> > >> >> >> > 2. Have explicit namespaces, which is what master branch currently >> >> looks >> >> >> > like: >> >> >> > >> >> >> > - org.apache.spark.sql.functions >> >> >> > - org.apache.spark.sql.mathfunctions >> >> >> > - ... >> >> >> > >> >> >> > 3. Have explicit namespaces, but restructure them slightly so >> >> everything >> >> >> > is under functions. >> >> >> > >> >> >> > package object functions { >> >> >> > >> >> >> > // all the old functions here -- but deprecated so we keep >> source >> >> >> > compatibility >> >> >> > def ... >> >> >> > } >> >> >> > >> >> >> > package org.apache.spark.sql.functions >> >> >> > >> >> >> > object mathFunc { >> >> >> > ... >> >> >> > } >> >> >> > >> >> >> > object basicFuncs { >> >> >> > ... >> >> >> > } >> >> >> > >> >> >> > >> >> >> > >> >> >> >> >> > >> >> > >> >> >> > >> > >
Re: [discuss] DataFrame function namespacing
IMHO I would go with choice #1 Cheers On Wed, Apr 29, 2015 at 10:03 PM, Reynold Xin wrote: > We definitely still have the name collision problem in SQL. > > On Wed, Apr 29, 2015 at 10:01 PM, Punyashloka Biswal < > punya.bis...@gmail.com > > wrote: > > > Do we still have to keep the names of the functions distinct to avoid > > collisions in SQL? Or is there a plan to allow "importing" a namespace > into > > SQL somehow? > > > > I ask because if we have to keep worrying about name collisions then I'm > > not sure what the added complexity of #2 and #3 buys us. > > > > Punya > > > > On Wed, Apr 29, 2015 at 3:52 PM Reynold Xin wrote: > > > >> Scaladoc isn't much of a problem because scaladocs are grouped. > >> Java/Python > >> is the main problem ... > >> > >> See > >> > >> > https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$ > >> > >> On Wed, Apr 29, 2015 at 3:38 PM, Shivaram Venkataraman < > >> shiva...@eecs.berkeley.edu> wrote: > >> > >> > My feeling is that we should have a handful of namespaces (say 4 or > 5). > >> It > >> > becomes too cumbersome to import / remember more package names and > >> having > >> > everything in one package makes it hard to read scaladoc etc. > >> > > >> > Thanks > >> > Shivaram > >> > > >> > On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin > >> wrote: > >> > > >> >> To add a little bit more context, some pros/cons I can think of are: > >> >> > >> >> Option 1: Very easy for users to find the function, since they are > all > >> in > >> >> org.apache.spark.sql.functions. However, there will be quite a large > >> >> number > >> >> of them. > >> >> > >> >> Option 2: I can't tell why we would want this one over Option 3, > since > >> it > >> >> has all the problems of Option 3, and not as nice of a hierarchy. > >> >> > >> >> Option 3: Opposite of Option 1. Each "package" or static class has a > >> small > >> >> number of functions that are relevant to each other, but for some > >> >> functions > >> >> it is unclear where they should go (e.g. should "min" go into basic > or > >> >> math?) > >> >> > >> >> > >> >> > >> >> > >> >> On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin > >> wrote: > >> >> > >> >> > Before we make DataFrame non-alpha, it would be great to decide how > >> we > >> >> > want to namespace all the functions. There are 3 alternatives: > >> >> > > >> >> > 1. Put all in org.apache.spark.sql.functions. This is how SQL does > >> it, > >> >> > since SQL doesn't have namespaces. I estimate eventually we will > >> have ~ > >> >> 200 > >> >> > functions. > >> >> > > >> >> > 2. Have explicit namespaces, which is what master branch currently > >> looks > >> >> > like: > >> >> > > >> >> > - org.apache.spark.sql.functions > >> >> > - org.apache.spark.sql.mathfunctions > >> >> > - ... > >> >> > > >> >> > 3. Have explicit namespaces, but restructure them slightly so > >> everything > >> >> > is under functions. > >> >> > > >> >> > package object functions { > >> >> > > >> >> > // all the old functions here -- but deprecated so we keep source > >> >> > compatibility > >> >> > def ... > >> >> > } > >> >> > > >> >> > package org.apache.spark.sql.functions > >> >> > > >> >> > object mathFunc { > >> >> > ... > >> >> > } > >> >> > > >> >> > object basicFuncs { > >> >> > ... > >> >> > } > >> >> > > >> >> > > >> >> > > >> >> > >> > > >> > > >> > > >
Re: [discuss] DataFrame function namespacing
We definitely still have the name collision problem in SQL. On Wed, Apr 29, 2015 at 10:01 PM, Punyashloka Biswal wrote: > Do we still have to keep the names of the functions distinct to avoid > collisions in SQL? Or is there a plan to allow "importing" a namespace into > SQL somehow? > > I ask because if we have to keep worrying about name collisions then I'm > not sure what the added complexity of #2 and #3 buys us. > > Punya > > On Wed, Apr 29, 2015 at 3:52 PM Reynold Xin wrote: > >> Scaladoc isn't much of a problem because scaladocs are grouped. >> Java/Python >> is the main problem ... >> >> See >> >> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$ >> >> On Wed, Apr 29, 2015 at 3:38 PM, Shivaram Venkataraman < >> shiva...@eecs.berkeley.edu> wrote: >> >> > My feeling is that we should have a handful of namespaces (say 4 or 5). >> It >> > becomes too cumbersome to import / remember more package names and >> having >> > everything in one package makes it hard to read scaladoc etc. >> > >> > Thanks >> > Shivaram >> > >> > On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin >> wrote: >> > >> >> To add a little bit more context, some pros/cons I can think of are: >> >> >> >> Option 1: Very easy for users to find the function, since they are all >> in >> >> org.apache.spark.sql.functions. However, there will be quite a large >> >> number >> >> of them. >> >> >> >> Option 2: I can't tell why we would want this one over Option 3, since >> it >> >> has all the problems of Option 3, and not as nice of a hierarchy. >> >> >> >> Option 3: Opposite of Option 1. Each "package" or static class has a >> small >> >> number of functions that are relevant to each other, but for some >> >> functions >> >> it is unclear where they should go (e.g. should "min" go into basic or >> >> math?) >> >> >> >> >> >> >> >> >> >> On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin >> wrote: >> >> >> >> > Before we make DataFrame non-alpha, it would be great to decide how >> we >> >> > want to namespace all the functions. There are 3 alternatives: >> >> > >> >> > 1. Put all in org.apache.spark.sql.functions. This is how SQL does >> it, >> >> > since SQL doesn't have namespaces. I estimate eventually we will >> have ~ >> >> 200 >> >> > functions. >> >> > >> >> > 2. Have explicit namespaces, which is what master branch currently >> looks >> >> > like: >> >> > >> >> > - org.apache.spark.sql.functions >> >> > - org.apache.spark.sql.mathfunctions >> >> > - ... >> >> > >> >> > 3. Have explicit namespaces, but restructure them slightly so >> everything >> >> > is under functions. >> >> > >> >> > package object functions { >> >> > >> >> > // all the old functions here -- but deprecated so we keep source >> >> > compatibility >> >> > def ... >> >> > } >> >> > >> >> > package org.apache.spark.sql.functions >> >> > >> >> > object mathFunc { >> >> > ... >> >> > } >> >> > >> >> > object basicFuncs { >> >> > ... >> >> > } >> >> > >> >> > >> >> > >> >> >> > >> > >> >
Re: [discuss] DataFrame function namespacing
Do we still have to keep the names of the functions distinct to avoid collisions in SQL? Or is there a plan to allow "importing" a namespace into SQL somehow? I ask because if we have to keep worrying about name collisions then I'm not sure what the added complexity of #2 and #3 buys us. Punya On Wed, Apr 29, 2015 at 3:52 PM Reynold Xin wrote: > Scaladoc isn't much of a problem because scaladocs are grouped. Java/Python > is the main problem ... > > See > > https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$ > > On Wed, Apr 29, 2015 at 3:38 PM, Shivaram Venkataraman < > shiva...@eecs.berkeley.edu> wrote: > > > My feeling is that we should have a handful of namespaces (say 4 or 5). > It > > becomes too cumbersome to import / remember more package names and having > > everything in one package makes it hard to read scaladoc etc. > > > > Thanks > > Shivaram > > > > On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin > wrote: > > > >> To add a little bit more context, some pros/cons I can think of are: > >> > >> Option 1: Very easy for users to find the function, since they are all > in > >> org.apache.spark.sql.functions. However, there will be quite a large > >> number > >> of them. > >> > >> Option 2: I can't tell why we would want this one over Option 3, since > it > >> has all the problems of Option 3, and not as nice of a hierarchy. > >> > >> Option 3: Opposite of Option 1. Each "package" or static class has a > small > >> number of functions that are relevant to each other, but for some > >> functions > >> it is unclear where they should go (e.g. should "min" go into basic or > >> math?) > >> > >> > >> > >> > >> On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin > wrote: > >> > >> > Before we make DataFrame non-alpha, it would be great to decide how we > >> > want to namespace all the functions. There are 3 alternatives: > >> > > >> > 1. Put all in org.apache.spark.sql.functions. This is how SQL does it, > >> > since SQL doesn't have namespaces. I estimate eventually we will have > ~ > >> 200 > >> > functions. > >> > > >> > 2. Have explicit namespaces, which is what master branch currently > looks > >> > like: > >> > > >> > - org.apache.spark.sql.functions > >> > - org.apache.spark.sql.mathfunctions > >> > - ... > >> > > >> > 3. Have explicit namespaces, but restructure them slightly so > everything > >> > is under functions. > >> > > >> > package object functions { > >> > > >> > // all the old functions here -- but deprecated so we keep source > >> > compatibility > >> > def ... > >> > } > >> > > >> > package org.apache.spark.sql.functions > >> > > >> > object mathFunc { > >> > ... > >> > } > >> > > >> > object basicFuncs { > >> > ... > >> > } > >> > > >> > > >> > > >> > > > > >
Re: [discuss] DataFrame function namespacing
Scaladoc isn't much of a problem because scaladocs are grouped. Java/Python is the main problem ... See https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$ On Wed, Apr 29, 2015 at 3:38 PM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > My feeling is that we should have a handful of namespaces (say 4 or 5). It > becomes too cumbersome to import / remember more package names and having > everything in one package makes it hard to read scaladoc etc. > > Thanks > Shivaram > > On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin wrote: > >> To add a little bit more context, some pros/cons I can think of are: >> >> Option 1: Very easy for users to find the function, since they are all in >> org.apache.spark.sql.functions. However, there will be quite a large >> number >> of them. >> >> Option 2: I can't tell why we would want this one over Option 3, since it >> has all the problems of Option 3, and not as nice of a hierarchy. >> >> Option 3: Opposite of Option 1. Each "package" or static class has a small >> number of functions that are relevant to each other, but for some >> functions >> it is unclear where they should go (e.g. should "min" go into basic or >> math?) >> >> >> >> >> On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin wrote: >> >> > Before we make DataFrame non-alpha, it would be great to decide how we >> > want to namespace all the functions. There are 3 alternatives: >> > >> > 1. Put all in org.apache.spark.sql.functions. This is how SQL does it, >> > since SQL doesn't have namespaces. I estimate eventually we will have ~ >> 200 >> > functions. >> > >> > 2. Have explicit namespaces, which is what master branch currently looks >> > like: >> > >> > - org.apache.spark.sql.functions >> > - org.apache.spark.sql.mathfunctions >> > - ... >> > >> > 3. Have explicit namespaces, but restructure them slightly so everything >> > is under functions. >> > >> > package object functions { >> > >> > // all the old functions here -- but deprecated so we keep source >> > compatibility >> > def ... >> > } >> > >> > package org.apache.spark.sql.functions >> > >> > object mathFunc { >> > ... >> > } >> > >> > object basicFuncs { >> > ... >> > } >> > >> > >> > >> > >
Re: [discuss] DataFrame function namespacing
My feeling is that we should have a handful of namespaces (say 4 or 5). It becomes too cumbersome to import / remember more package names and having everything in one package makes it hard to read scaladoc etc. Thanks Shivaram On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin wrote: > To add a little bit more context, some pros/cons I can think of are: > > Option 1: Very easy for users to find the function, since they are all in > org.apache.spark.sql.functions. However, there will be quite a large number > of them. > > Option 2: I can't tell why we would want this one over Option 3, since it > has all the problems of Option 3, and not as nice of a hierarchy. > > Option 3: Opposite of Option 1. Each "package" or static class has a small > number of functions that are relevant to each other, but for some functions > it is unclear where they should go (e.g. should "min" go into basic or > math?) > > > > > On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin wrote: > > > Before we make DataFrame non-alpha, it would be great to decide how we > > want to namespace all the functions. There are 3 alternatives: > > > > 1. Put all in org.apache.spark.sql.functions. This is how SQL does it, > > since SQL doesn't have namespaces. I estimate eventually we will have ~ > 200 > > functions. > > > > 2. Have explicit namespaces, which is what master branch currently looks > > like: > > > > - org.apache.spark.sql.functions > > - org.apache.spark.sql.mathfunctions > > - ... > > > > 3. Have explicit namespaces, but restructure them slightly so everything > > is under functions. > > > > package object functions { > > > > // all the old functions here -- but deprecated so we keep source > > compatibility > > def ... > > } > > > > package org.apache.spark.sql.functions > > > > object mathFunc { > > ... > > } > > > > object basicFuncs { > > ... > > } > > > > > > >
Re: [discuss] DataFrame function namespacing
To add a little bit more context, some pros/cons I can think of are: Option 1: Very easy for users to find the function, since they are all in org.apache.spark.sql.functions. However, there will be quite a large number of them. Option 2: I can't tell why we would want this one over Option 3, since it has all the problems of Option 3, and not as nice of a hierarchy. Option 3: Opposite of Option 1. Each "package" or static class has a small number of functions that are relevant to each other, but for some functions it is unclear where they should go (e.g. should "min" go into basic or math?) On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin wrote: > Before we make DataFrame non-alpha, it would be great to decide how we > want to namespace all the functions. There are 3 alternatives: > > 1. Put all in org.apache.spark.sql.functions. This is how SQL does it, > since SQL doesn't have namespaces. I estimate eventually we will have ~ 200 > functions. > > 2. Have explicit namespaces, which is what master branch currently looks > like: > > - org.apache.spark.sql.functions > - org.apache.spark.sql.mathfunctions > - ... > > 3. Have explicit namespaces, but restructure them slightly so everything > is under functions. > > package object functions { > > // all the old functions here -- but deprecated so we keep source > compatibility > def ... > } > > package org.apache.spark.sql.functions > > object mathFunc { > ... > } > > object basicFuncs { > ... > } > > >
[discuss] DataFrame function namespacing
Before we make DataFrame non-alpha, it would be great to decide how we want to namespace all the functions. There are 3 alternatives: 1. Put all in org.apache.spark.sql.functions. This is how SQL does it, since SQL doesn't have namespaces. I estimate eventually we will have ~ 200 functions. 2. Have explicit namespaces, which is what master branch currently looks like: - org.apache.spark.sql.functions - org.apache.spark.sql.mathfunctions - ... 3. Have explicit namespaces, but restructure them slightly so everything is under functions. package object functions { // all the old functions here -- but deprecated so we keep source compatibility def ... } package org.apache.spark.sql.functions object mathFunc { ... } object basicFuncs { ... }