In my understanding, parity matters if 1) frameworks share a similar user
base and use cases (sklearn, pandas, etc.)
or 2) one framework shares APIs with another (dask, modin, pandas).
Otherwise, forcing parity can be counterproductive. During our work on
feature transformations,
we have seen major differences in supported feature transformations, user
APIs, and configurations among ML Systems.
For instance, TensorFlow tunes its APIs based on the expected use cases
(neural network) and data
characteristics (text, image), while sklearn aims for traditional ML jobs.
Moreover, some API changes are
required to be able to use certain underlying optimizations.
Having said that, It is definitely important to support popular builtins,
however, I don't think it is necessary to
use the same names, APIs, and flags. I liked the idea of writing our
documentation in a way that helps new users to draw
similarities with popular libraries. A capability matrix to map builtins
from other systems to ours can be helpful.

Regards,
Arnab..

On Tue, Aug 2, 2022 at 6:16 AM Janardhan <janard...@apache.org> wrote:

> Hi Badrul,
>
> Adding to this discussion,
> I think we can start with what we already have implemented. We do not
> need to implement every last function, we can choose a use-case based
> approach for best results. I would start with the present status of
> the builtins - they are enough for a lot of use cases! then implement
> one by one based on priority. Most of our builtin functions other than
> ML (including NN library) are inspired from R language.
>
> During the implementation/testing, we might need to modify/could find
> optimization opportunities for our system internals.
>
> One of the approaches:
> 1. Take an algorithm/product that is already implemented in another
> system/library.
> 2. Find places where SystemDS can perform better. Find the low hanging
> fruit, like can we use one of our python builtins or a combination to
> achieve similar or better results. and can we improve it further.
> 3. So, we identified a candidate for builtin.
> 4. and repeat the cycle.
>
>
> Best regards,
> Janardhan
>
>
>
> On Tue, Aug 2, 2022 at 2:09 AM Badrul Chowdhury
> <badrulchowdhur...@gmail.com> wrote:
> >
> > Hi,
> >
> > I wanted to start a discussion on building parity of built-in functions
> > with popular OSS libraries. I am thinking of attaining parity as a 3-step
> > process:
> >
> > *Step 1*
> > As far as I can tell from the existing built-in functions, SystemDS aims
> to
> > offer a hybrid set of APIs for scientific computing and ML (data
> > engineering included) to users. Therefore, the most obvious OSS libraries
> > for comparison would be numpy, sklearn (scipy), and pandas. Apache
> > DataSketches would be another relevant system for specialized use cases
> > (sketches).
> >
> > *Step 2*
> > Once we have established a set of libraries, I would propose that we
> create
> > a capability matrix with sections for each library, like so:
> >
> > Section 1: numpy
> >
> > f_1
> >
> > f_2
> >
> > [..]
> >
> >
> > f_n
> >
> > Section 2: sklearn
> >
> > [..]
> >
> >
> > The columns could be a checklist like this: f_i -> (DML, Python, CP, SP,
> > RowCol, Row, Col, Federated, documentationPublished)
> >
> > *Step 3*
> > Create JIRA tasks, assign them, and start coding.
> >
> >
> > Thoughts?
> >
> >
> > Thanks,
> > Badrul
>

Reply via email to