Hi Arnab,
The changes contributed by me are followings,
Built-ins:
- dropInvalidLength() and dropInvalidType(): frame built-ins for
data cleaning using schema and length information.
- glm(): Generalized Linear Model added as a built-in from our
algorithms.
- imputeFD(): for missing value imputation using robust functional
dependencies.
- Update in an existing built-in MICE (now works on matrices
instead of frames).
- map() for supporting lambda expressions.
- smote(): an ovesampling technique for class imbalance.
- na_locf(): built-in for forward and backward NA filling.
- gmm(): Gaussian mixture model (experimental feature)
Binary Operations:
- Comparison operations for frame-frame ops.
Feel free to make any changes you deem necessary.
Best Regards,
Shafaq Siddiqi
On 9/7/2020 9:51 AM, Baunsgaard, Sebastian wrote:
Hi Arnab,
Here is my list, feel free to remove elements 😊
Major:
- Refactor Compression package and add functions
- add Quanization for lossy compression
- Generalize column groups to use same base dictionary
- Binary cell operations
- Left Matrix Multiplication
- GitHub actions for automated testing
- Improved Compile times, and packaging
- Docker containers for systemds, pythonsystemds and testingsystemds
Minor:
- python PCA and MultiLogReg algorithms
- parallel sort
- parallel detect schema
- Url handler for federated
- Distinct values count / estimation function
- Simplified Log4J from being Hadoop based to our own
- Handle NaStrings in CSV reading frame and matrix
- Re-enable code coverage tools
Removed
- GitHub pages, for documentation and moved to master
- Travis testing
Best regards
Sebastian
________________________________
From: arnab phani <phaniar...@gmail.com>
Sent: Monday, September 7, 2020 9:26:12 AM
To: dev@systemds.apache.org
Subject: Re: [DISCUSS] Apache SystemDS 2.0 Release
Thanks Kevin.
Other committers: once you get a chance, please send me your contributions
too.
Regards,
Arnab..
On Wed, Sep 2, 2020 at 10:04 PM Kevin Innerebner <
innereb...@student.tugraz.at> wrote:
Hi,
here are the changes I contributed after March 24:
- Added SystemDSContext to python api (now necessary for operations)
- Added federated frames
- Federated transform-encode, -decode and -apply (missing value
imputation is still an ongoing PR, I think it will be merged in before
release)
- New builtin `colnames()` to get the column names of a frame
That should be everything from my side.
Regards,
Kevin
On 9/1/20 11:36 AM, arnab phani wrote:
Hi All,
As we are nearing the release, I am starting to focus on the release
notes.
Notes for SystemDS 2.0 release should consolidate all the things that
happened since Aug 2018 (last SystemML release).
While I will aggregate the notes from two SystemDS releases, it will be
great if you can update me with a few lines summarizing the additions to
your features (including the external contributions), especially after
March 24, 2020 (last SystemDS release).
Once ready, I will share for everyone to have a look.
Regards,
Arnab..
On Mon, Aug 31, 2020 at 8:34 PM Matthias Boehm <mboe...@gmail.com>
wrote:
thanks Arnab for looking over the remaining open issues. Together with
Shafaq, we just came across two additional bugs related to eval function
calls. Theses fixes should go into the RC and I intend to fix them as
soon as possible.
Regards,
Matthias
On 8/27/2020 8:41 PM, arnab phani wrote:
Hi All,
Currently, I see only a few issues are flagged for 2.0 release. Can you
please go through your open issues and check if the Fix-Version is set?
Also, if a JIRA task doesn't exist for something you are working on or
want
to have in the coming release, please open a task and flag it for 2.0.
Regards,
Arnab..
On Thu, Aug 20, 2020 at 8:18 PM Matthias Boehm <mboe...@gmail.com>
wrote:
as the target release date end of August comes closer, I'd like to
share
that Arnab Phani kindly volunteered in an offline discussion to act as
the release manager for our 2.0 release.
Please, flag issues and features you think are important for the 2.0
release as such in JIRA so we can monitor them, discuss them on a case
by case basis, and push the release date if necessary. Thanks.
Regards,
Matthias
On 8/17/2020 2:51 PM, Janardhan wrote:
Hi,
The following is the status of the MLContext test for algorithms.
1. l2svm, msvm, PCA - scripts are running + results are not equal to
R
2. Autoencoder, StepwiseReg - Scripts are not running
3. KMeans, GLM (need to fix R) - No R script
Thank you,
Janardhan
On Fri, Jul 10, 2020 at 2:29 AM Matthias Boehm <mboe...@gmail.com>
wrote:
thanks for the perspective, I think we should be very pragmatic
regarding languages. Let's stick to DML as our domain-specific
language
with R-like syntax, but add language bindings such as the Python API
(and others) to seamlessly plug into common data science workflows.
A
similar mind set worked very well in the internals too: Java for
nicely
integrating with Hadoop/Spark and simplicity, but with C++ and CUDA
kernels and native libraries where necessary.
Regards,
Matthias
On 7/9/2020 3:54 PM, Janardhan wrote:
DML - %*% seems more Intuitive compared to @. Let us not change the
syntax
( our selling point easy porting to R! )
Python - no solid opinion
- Janardhan
On Thu, 9 Jul, 2020, 19:06 Matthias Boehm, <mboe...@gmail.com>
wrote:
for the Python API this is fine, for DML not as we should stick as
close
as possible to R syntax. Once we had a pydml syntax too, but this
created lots of inconsistencies and could not use Python as a host
language. So, I think restricting such changes to the Python API
is
a
good path forward. Other opinions?
Regards,
Matthias
On 7/9/2020 3:31 PM, Baunsgaard, Sebastian wrote:
Hi all
Can i suggest a radical change of matrix multiply.
to change the command from %*% to @.
Python has made this commitment!
https://www.python.org/dev/peps/pep-0465/
or at least change this in the python API?
Best regards
Sebastian
________________________________
From: Matthias Boehm <mboe...@gmail.com>
Sent: Wednesday, July 8, 2020 11:04:12 PM
To: dev@systemds.apache.org
Subject: [DISCUSS] Apache SystemDS 2.0 Release
Hi all,
I'd like to propose Aug 31 as a target date for the SystemDS 2.0
release
(feature freeze August 21). This should gives us enough time to
figure
out the list of things that still should go into this release as
it's
an
opportunity of a major for changes of external behavior. However,
as
it's the first SystemDS Apache release, I think we should still
stick
to
Spark 2.x and Java 8 and consider upgrades of Spark and the JDK
for
subsequent releases. So, what do you think and any major features
you'd
like to see complete for 2.0?
Regards,
Matthias