Re: Plan for Builtin Functions Parity with Numpy, etc

Matthias Boehm Fri, 05 Aug 2022 11:42:36 -0700

thanks for driving this discussion Badrul. In general, I think it's agreat idea to do an assessment of coverage as a basis for discussionsregarding further development, API consistency, and improveddocumentation. At algorithm level you will encounter subtle differencesdue to different algorithmic choices, implementation details, andrelated parameters. Here we should not make ourselves dependent onexisting libraries but make case-by-base decisions, balancing variousconstraints with the benefits of API similarity.

By default, we stick to names of R builtin functions of selectedpackages (e.g., Matrix, stats, algorithms) and indexing semantics (e.g.,copy on write, 1-based indexing) but should look more broadly (numpy,pandas) for missing functionality at DSL and API level. The overallvision of SystemDS is to build up a hierarchy of builtin functions forthe entire data science lifecycle (data preparation, cleaning, training,scoring, debugging) while still being able to compile hybrid runtimeplans for local CPU/GPU, distributed, and federated backends.

Let's do this assessment in the main github repo (e.g., as markdownfiles in docs) before we put anything on the main website as we need todistinguish the assessment from actual documentation. Thanks.


Regards,
Matthias

On 8/5/2022 8:23 PM, Badrul Chowdhury wrote:

Thank you both for your thoughtful comments. Agreed: we should not force
parity; rather, we should make sure that SystemDS built-in functions
"cover" important use cases. I will start with an audit of SystemDS's
existing capabilities and create a PR on systemds-website
<https://github.com/apache/systemds-website> with my findings. This would
also be a good way to identify gaps in the documentation for existing
builtins so we can update it.

Thanks,
Badrul

On Tue, 2 Aug 2022 at 06:12, arnab phani <[email protected]> wrote:

In my understanding, parity matters if 1) frameworks share a similar user
base and use cases (sklearn, pandas, etc.)
or 2) one framework shares APIs with another (dask, modin, pandas).
Otherwise, forcing parity can be counterproductive. During our work on
feature transformations,
we have seen major differences in supported feature transformations, user
APIs, and configurations among ML Systems.
For instance, TensorFlow tunes its APIs based on the expected use cases
(neural network) and data
characteristics (text, image), while sklearn aims for traditional ML jobs.
Moreover, some API changes are
required to be able to use certain underlying optimizations.
Having said that, It is definitely important to support popular builtins,
however, I don't think it is necessary to
use the same names, APIs, and flags. I liked the idea of writing our
documentation in a way that helps new users to draw
similarities with popular libraries. A capability matrix to map builtins
from other systems to ours can be helpful.

Regards,
Arnab..

On Tue, Aug 2, 2022 at 6:16 AM Janardhan <[email protected]> wrote:

Hi Badrul,

Adding to this discussion,
I think we can start with what we already have implemented. We do not
need to implement every last function, we can choose a use-case based
approach for best results. I would start with the present status of
the builtins - they are enough for a lot of use cases! then implement
one by one based on priority. Most of our builtin functions other than
ML (including NN library) are inspired from R language.

During the implementation/testing, we might need to modify/could find
optimization opportunities for our system internals.

One of the approaches:
1. Take an algorithm/product that is already implemented in another
system/library.
2. Find places where SystemDS can perform better. Find the low hanging
fruit, like can we use one of our python builtins or a combination to
achieve similar or better results. and can we improve it further.
3. So, we identified a candidate for builtin.
4. and repeat the cycle.


Best regards,
Janardhan



On Tue, Aug 2, 2022 at 2:09 AM Badrul Chowdhury
<[email protected]> wrote:


Hi,

I wanted to start a discussion on building parity of built-in functions
with popular OSS libraries. I am thinking of attaining parity as a

3-step

process:

*Step 1*
As far as I can tell from the existing built-in functions, SystemDS

aims

to

offer a hybrid set of APIs for scientific computing and ML (data
engineering included) to users. Therefore, the most obvious OSS

libraries

for comparison would be numpy, sklearn (scipy), and pandas. Apache
DataSketches would be another relevant system for specialized use cases
(sketches).

*Step 2*
Once we have established a set of libraries, I would propose that we

create

a capability matrix with sections for each library, like so:

Section 1: numpy

f_1

f_2

[..]


f_n

Section 2: sklearn

[..]


The columns could be a checklist like this: f_i -> (DML, Python, CP,

SP,

RowCol, Row, Col, Federated, documentationPublished)

*Step 3*
Create JIRA tasks, assign them, and start coding.


Thoughts?


Thanks,
Badrul

Re: Plan for Builtin Functions Parity with Numpy, etc

Reply via email to