Re: [discuss] DataFrame function namespacing

Reynold Xin Mon, 04 May 2015 22:50:18 -0700

After talking with people on this thread and offline, I've decided to go
with option 1, i.e. putting everything in a single "functions" object.



On Thu, Apr 30, 2015 at 10:04 AM, Ted Yu <[email protected]> wrote:

> IMHO I would go with choice #1
>
> Cheers
>
> On Wed, Apr 29, 2015 at 10:03 PM, Reynold Xin <[email protected]> wrote:
>
>> We definitely still have the name collision problem in SQL.
>>
>> On Wed, Apr 29, 2015 at 10:01 PM, Punyashloka Biswal <
>> [email protected]
>> > wrote:
>>
>> > Do we still have to keep the names of the functions distinct to avoid
>> > collisions in SQL? Or is there a plan to allow "importing" a namespace
>> into
>> > SQL somehow?
>> >
>> > I ask because if we have to keep worrying about name collisions then I'm
>> > not sure what the added complexity of #2 and #3 buys us.
>> >
>> > Punya
>> >
>> > On Wed, Apr 29, 2015 at 3:52 PM Reynold Xin <[email protected]>
>> wrote:
>> >
>> >> Scaladoc isn't much of a problem because scaladocs are grouped.
>> >> Java/Python
>> >> is the main problem ...
>> >>
>> >> See
>> >>
>> >>
>> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
>> >>
>> >> On Wed, Apr 29, 2015 at 3:38 PM, Shivaram Venkataraman <
>> >> [email protected]> wrote:
>> >>
>> >> > My feeling is that we should have a handful of namespaces (say 4 or
>> 5).
>> >> It
>> >> > becomes too cumbersome to import / remember more package names and
>> >> having
>> >> > everything in one package makes it hard to read scaladoc etc.
>> >> >
>> >> > Thanks
>> >> > Shivaram
>> >> >
>> >> > On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin <[email protected]>
>> >> wrote:
>> >> >
>> >> >> To add a little bit more context, some pros/cons I can think of are:
>> >> >>
>> >> >> Option 1: Very easy for users to find the function, since they are
>> all
>> >> in
>> >> >> org.apache.spark.sql.functions. However, there will be quite a large
>> >> >> number
>> >> >> of them.
>> >> >>
>> >> >> Option 2: I can't tell why we would want this one over Option 3,
>> since
>> >> it
>> >> >> has all the problems of Option 3, and not as nice of a hierarchy.
>> >> >>
>> >> >> Option 3: Opposite of Option 1. Each "package" or static class has a
>> >> small
>> >> >> number of functions that are relevant to each other, but for some
>> >> >> functions
>> >> >> it is unclear where they should go (e.g. should "min" go into basic
>> or
>> >> >> math?)
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin <[email protected]>
>> >> wrote:
>> >> >>
>> >> >> > Before we make DataFrame non-alpha, it would be great to decide
>> how
>> >> we
>> >> >> > want to namespace all the functions. There are 3 alternatives:
>> >> >> >
>> >> >> > 1. Put all in org.apache.spark.sql.functions. This is how SQL does
>> >> it,
>> >> >> > since SQL doesn't have namespaces. I estimate eventually we will
>> >> have ~
>> >> >> 200
>> >> >> > functions.
>> >> >> >
>> >> >> > 2. Have explicit namespaces, which is what master branch currently
>> >> looks
>> >> >> > like:
>> >> >> >
>> >> >> > - org.apache.spark.sql.functions
>> >> >> > - org.apache.spark.sql.mathfunctions
>> >> >> > - ...
>> >> >> >
>> >> >> > 3. Have explicit namespaces, but restructure them slightly so
>> >> everything
>> >> >> > is under functions.
>> >> >> >
>> >> >> > package object functions {
>> >> >> >
>> >> >> >   // all the old functions here -- but deprecated so we keep
>> source
>> >> >> > compatibility
>> >> >> >   def ...
>> >> >> > }
>> >> >> >
>> >> >> > package org.apache.spark.sql.functions
>> >> >> >
>> >> >> > object mathFunc {
>> >> >> >   ...
>> >> >> > }
>> >> >> >
>> >> >> > object basicFuncs {
>> >> >> >   ...
>> >> >> > }
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >>
>> >> >
>> >> >
>> >>
>> >
>>
>
>

Re: [discuss] DataFrame function namespacing

Reply via email to