Re: [DISCUSS] Apache SystemDS 2.0 Release

Janardhan Mon, 07 Sep 2020 10:56:19 -0700

Hi Arnab,

our team have contributed the following:


1. Thoroughly documented the builtin functions
2. Starter template for working with databricks and colab

Thank you,
Janardhan

On Mon, Sep 7, 2020 at 6:59 PM Shafaq Siddiqi <[email protected]>
wrote:

> Hi Arnab,
>
> The changes contributed by me are followings,
>
> Built-ins:
>     -   dropInvalidLength() and dropInvalidType(): frame built-ins for
> data cleaning using schema and length information.
>     -   glm(): Generalized Linear Model added as a built-in from our
> algorithms.
>     -   imputeFD(): for missing value imputation using robust functional
> dependencies.
>     -   Update in an existing built-in MICE (now works on matrices
> instead of frames).
>     -   map() for supporting lambda expressions.
>     -   smote(): an ovesampling technique for class imbalance.
>     -   na_locf(): built-in for forward and backward NA filling.
>     -   gmm(): Gaussian mixture model  (experimental feature)
>
> Binary Operations:
>    -   Comparison operations for frame-frame ops.
>
> Feel free to make any changes you deem necessary.
>
> Best Regards,
> Shafaq Siddiqi
>
> On 9/7/2020 9:51 AM, Baunsgaard, Sebastian wrote:
> > Hi Arnab,
> >
> > Here is my list, feel free to remove elements 😊
> >
> > Major:
> >
> > - Refactor Compression package and add functions
> >    - add Quanization for lossy compression
> >    - Generalize column groups to use same base dictionary
> >    - Binary cell operations
> >    - Left Matrix Multiplication
> > - GitHub actions for automated testing
> > - Improved Compile times, and packaging
> > - Docker containers for systemds, pythonsystemds and testingsystemds
> >
> > Minor:
> >
> > - python PCA and MultiLogReg algorithms
> > - parallel sort
> > - parallel detect schema
> > - Url handler for federated
> > - Distinct values count / estimation function
> > - Simplified Log4J from being Hadoop based to our own
> > - Handle NaStrings in CSV reading frame and matrix
> > - Re-enable code coverage tools
> >
> > Removed
> >
> > - GitHub pages, for documentation and moved to master
> > - Travis testing
> >
> >
> > Best regards
> >
> > Sebastian
> >
> > ________________________________
> > From: arnab phani <[email protected]>
> > Sent: Monday, September 7, 2020 9:26:12 AM
> > To: [email protected]
> > Subject: Re: [DISCUSS] Apache SystemDS 2.0 Release
> >
> > Thanks Kevin.
> >
> > Other committers: once you get a chance, please send me your
> contributions
> > too.
> >
> > Regards,
> > Arnab..
> >
> > On Wed, Sep 2, 2020 at 10:04 PM Kevin Innerebner <
> > [email protected]> wrote:
> >
> >> Hi,
> >>
> >> here are the changes I contributed after March 24:
> >>
> >> - Added SystemDSContext to python api (now necessary for operations)
> >>
> >> - Added federated frames
> >>
> >> - Federated transform-encode, -decode and -apply (missing value
> >> imputation is still an ongoing PR, I think it will be merged in before
> >> release)
> >>
> >> - New builtin `colnames()` to get the column names of a frame
> >>
> >> That should be everything from my side.
> >>
> >> Regards,
> >> Kevin
> >>
> >> On 9/1/20 11:36 AM, arnab phani wrote:
> >>> Hi All,
> >>>
> >>> As we are nearing the release, I am starting to focus on the release
> >> notes.
> >>> Notes for SystemDS 2.0 release should consolidate all the things that
> >>> happened since Aug 2018 (last SystemML release).
> >>> While I will aggregate the notes from two SystemDS releases, it will be
> >>> great if you can update me with a few lines summarizing the additions
> to
> >>> your features (including the external contributions), especially after
> >>> March 24, 2020 (last SystemDS release).
> >>>
> >>> Once ready, I will share for everyone to have a look.
> >>>
> >>> Regards,
> >>> Arnab..
> >>>
> >>> On Mon, Aug 31, 2020 at 8:34 PM Matthias Boehm <[email protected]>
> >> wrote:
> >>>> thanks Arnab for looking over the remaining open issues. Together with
> >>>> Shafaq, we just came across two additional bugs related to eval
> function
> >>>> calls. Theses fixes should go into the RC and I intend to fix them as
> >>>> soon as possible.
> >>>>
> >>>> Regards,
> >>>> Matthias
> >>>>
> >>>> On 8/27/2020 8:41 PM, arnab phani wrote:
> >>>>> Hi All,
> >>>>>
> >>>>> Currently, I see only a few issues are flagged for 2.0 release. Can
> you
> >>>>> please go through your open issues and check if the Fix-Version is
> set?
> >>>>> Also, if a JIRA task doesn't exist for something you are working on
> or
> >>>> want
> >>>>> to have in the coming release, please open a task and flag it for
> 2.0.
> >>>>>
> >>>>> Regards,
> >>>>> Arnab..
> >>>>>
> >>>>> On Thu, Aug 20, 2020 at 8:18 PM Matthias Boehm <[email protected]>
> >>>> wrote:
> >>>>>> as the target release date end of August comes closer, I'd like to
> >> share
> >>>>>> that Arnab Phani kindly volunteered in an offline discussion to act
> as
> >>>>>> the release manager for our 2.0 release.
> >>>>>>
> >>>>>> Please, flag issues and features you think are important for the 2.0
> >>>>>> release as such in JIRA so we can monitor them, discuss them on a
> case
> >>>>>> by case basis, and push the release date if necessary. Thanks.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Matthias
> >>>>>>
> >>>>>> On 8/17/2020 2:51 PM, Janardhan wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> The following is the status of the MLContext test for algorithms.
> >>>>>>>
> >>>>>>> 1. l2svm, msvm, PCA - scripts are running + results are not equal
> to
> >> R
> >>>>>>> 2. Autoencoder, StepwiseReg - Scripts are not running
> >>>>>>> 3. KMeans, GLM (need to fix R) - No R script
> >>>>>>>
> >>>>>>> Thank you,
> >>>>>>> Janardhan
> >>>>>>>
> >>>>>>> On Fri, Jul 10, 2020 at 2:29 AM Matthias Boehm <[email protected]>
> >>>>>> wrote:
> >>>>>>>> thanks for the perspective, I think we should be very pragmatic
> >>>>>>>> regarding languages. Let's stick to DML as our domain-specific
> >>>> language
> >>>>>>>> with R-like syntax, but add language bindings such as the Python
> API
> >>>>>>>> (and others) to seamlessly plug into common data science
> workflows.
> >> A
> >>>>>>>> similar mind set worked very well in the internals too: Java for
> >>>> nicely
> >>>>>>>> integrating with Hadoop/Spark and simplicity, but with C++ and
> CUDA
> >>>>>>>> kernels and native libraries where necessary.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Matthias
> >>>>>>>>
> >>>>>>>> On 7/9/2020 3:54 PM, Janardhan wrote:
> >>>>>>>>> DML - %*% seems more Intuitive compared to @. Let us not change
> the
> >>>>>>>> syntax
> >>>>>>>>> ( our selling point easy porting to R! )
> >>>>>>>>> Python - no solid opinion
> >>>>>>>>>
> >>>>>>>>> - Janardhan
> >>>>>>>>>
> >>>>>>>>> On Thu, 9 Jul, 2020, 19:06 Matthias Boehm, <[email protected]>
> >>>> wrote:
> >>>>>>>>>> for the Python API this is fine, for DML not as we should stick
> as
> >>>>>> close
> >>>>>>>>>> as possible to R syntax. Once we had a pydml syntax too, but
> this
> >>>>>>>>>> created lots of inconsistencies and could not use Python as a
> host
> >>>>>>>>>> language. So, I think restricting such changes to the Python API
> >> is
> >>>> a
> >>>>>>>>>> good path forward. Other opinions?
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Matthias
> >>>>>>>>>>
> >>>>>>>>>> On 7/9/2020 3:31 PM, Baunsgaard, Sebastian wrote:
> >>>>>>>>>>> Hi all
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Can i suggest a radical change of matrix multiply.
> >>>>>>>>>>> to change the command from %*% to @.
> >>>>>>>>>>>
> >>>>>>>>>>> Python has made this commitment!
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> https://www.python.org/dev/peps/pep-0465/
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> or at least change this in the python API?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Best regards
> >>>>>>>>>>>
> >>>>>>>>>>> Sebastian
> >>>>>>>>>>>
> >>>>>>>>>>> ________________________________
> >>>>>>>>>>> From: Matthias Boehm <[email protected]>
> >>>>>>>>>>> Sent: Wednesday, July 8, 2020 11:04:12 PM
> >>>>>>>>>>> To: [email protected]
> >>>>>>>>>>> Subject: [DISCUSS] Apache SystemDS 2.0 Release
> >>>>>>>>>>>
> >>>>>>>>>>> Hi all,
> >>>>>>>>>>>
> >>>>>>>>>>> I'd like to propose Aug 31 as a target date for the SystemDS
> 2.0
> >>>>>>>> release
> >>>>>>>>>>> (feature freeze August 21). This should gives us enough time to
> >>>>>> figure
> >>>>>>>>>>> out the list of things that still should go into this release
> as
> >>>> it's
> >>>>>>>> an
> >>>>>>>>>>> opportunity of a major for changes of external behavior.
> However,
> >>>> as
> >>>>>>>>>>> it's the first SystemDS Apache release, I think we should still
> >>>> stick
> >>>>>>>> to
> >>>>>>>>>>> Spark 2.x and Java 8 and consider upgrades of Spark and the JDK
> >> for
> >>>>>>>>>>> subsequent releases. So, what do you think and any major
> features
> >>>>>> you'd
> >>>>>>>>>>> like to see complete for 2.0?
> >>>>>>>>>>>
> >>>>>>>>>>> Regards,
> >>>>>>>>>>> Matthias
> >>>>>>>>>>>
>

Re: [DISCUSS] Apache SystemDS 2.0 Release

Reply via email to