thanks for driving this discussion Badrul. In general, I think it's a
great idea to do an assessment of coverage as a basis for discussions
regarding further development, API consistency, and improved
documentation. At algorithm level you will encounter subtle differences
due to different algorithmic choices, implementation details, and
related parameters. Here we should not make ourselves dependent on
existing libraries but make case-by-base decisions, balancing various
constraints with the benefits of API similarity.
By default, we stick to names of R builtin functions of selected
packages (e.g., Matrix, stats, algorithms) and indexing semantics (e.g.,
copy on write, 1-based indexing) but should look more broadly (numpy,
pandas) for missing functionality at DSL and API level. The overall
vision of SystemDS is to build up a hierarchy of builtin functions for
the entire data science lifecycle (data preparation, cleaning, training,
scoring, debugging) while still being able to compile hybrid runtime
plans for local CPU/GPU, distributed, and federated backends.
Let's do this assessment in the main github repo (e.g., as markdown
files in docs) before we put anything on the main website as we need to
distinguish the assessment from actual documentation. Thanks.
Regards,
Matthias
On 8/5/2022 8:23 PM, Badrul Chowdhury wrote:
Thank you both for your thoughtful comments. Agreed: we should not force
parity; rather, we should make sure that SystemDS built-in functions
"cover" important use cases. I will start with an audit of SystemDS's
existing capabilities and create a PR on systemds-website
<https://github.com/apache/systemds-website> with my findings. This would
also be a good way to identify gaps in the documentation for existing
builtins so we can update it.
Thanks,
Badrul
On Tue, 2 Aug 2022 at 06:12, arnab phani <phaniar...@gmail.com> wrote:
In my understanding, parity matters if 1) frameworks share a similar user
base and use cases (sklearn, pandas, etc.)
or 2) one framework shares APIs with another (dask, modin, pandas).
Otherwise, forcing parity can be counterproductive. During our work on
feature transformations,
we have seen major differences in supported feature transformations, user
APIs, and configurations among ML Systems.
For instance, TensorFlow tunes its APIs based on the expected use cases
(neural network) and data
characteristics (text, image), while sklearn aims for traditional ML jobs.
Moreover, some API changes are
required to be able to use certain underlying optimizations.
Having said that, It is definitely important to support popular builtins,
however, I don't think it is necessary to
use the same names, APIs, and flags. I liked the idea of writing our
documentation in a way that helps new users to draw
similarities with popular libraries. A capability matrix to map builtins
from other systems to ours can be helpful.
Regards,
Arnab..
On Tue, Aug 2, 2022 at 6:16 AM Janardhan <janard...@apache.org> wrote:
Hi Badrul,
Adding to this discussion,
I think we can start with what we already have implemented. We do not
need to implement every last function, we can choose a use-case based
approach for best results. I would start with the present status of
the builtins - they are enough for a lot of use cases! then implement
one by one based on priority. Most of our builtin functions other than
ML (including NN library) are inspired from R language.
During the implementation/testing, we might need to modify/could find
optimization opportunities for our system internals.
One of the approaches:
1. Take an algorithm/product that is already implemented in another
system/library.
2. Find places where SystemDS can perform better. Find the low hanging
fruit, like can we use one of our python builtins or a combination to
achieve similar or better results. and can we improve it further.
3. So, we identified a candidate for builtin.
4. and repeat the cycle.
Best regards,
Janardhan
On Tue, Aug 2, 2022 at 2:09 AM Badrul Chowdhury
<badrulchowdhur...@gmail.com> wrote:
Hi,
I wanted to start a discussion on building parity of built-in functions
with popular OSS libraries. I am thinking of attaining parity as a
3-step
process:
*Step 1*
As far as I can tell from the existing built-in functions, SystemDS
aims
to
offer a hybrid set of APIs for scientific computing and ML (data
engineering included) to users. Therefore, the most obvious OSS
libraries
for comparison would be numpy, sklearn (scipy), and pandas. Apache
DataSketches would be another relevant system for specialized use cases
(sketches).
*Step 2*
Once we have established a set of libraries, I would propose that we
create
a capability matrix with sections for each library, like so:
Section 1: numpy
f_1
f_2
[..]
f_n
Section 2: sklearn
[..]
The columns could be a checklist like this: f_i -> (DML, Python, CP,
SP,
RowCol, Row, Col, Federated, documentationPublished)
*Step 3*
Create JIRA tasks, assign them, and start coding.
Thoughts?
Thanks,
Badrul