Re: Plan for Builtin Functions Parity with Numpy, etc

Badrul Chowdhury Fri, 05 Aug 2022 23:55:43 -0700

Thanks for sharing your thoughts Matthias! Ack, I will create the PR in the
main repo.


Thanks,
Badrul

On Fri, 5 Aug 2022 at 11:42, Matthias Boehm <mboe...@gmail.com> wrote:

> thanks for driving this discussion Badrul. In general, I think it's a
> great idea to do an assessment of coverage as a basis for discussions
> regarding further development, API consistency, and improved
> documentation. At algorithm level you will encounter subtle differences
> due to different algorithmic choices, implementation details, and
> related parameters. Here we should not make ourselves dependent on
> existing libraries but make case-by-base decisions, balancing various
> constraints with the benefits of API similarity.
>
> By default, we stick to names of R builtin functions of selected
> packages (e.g., Matrix, stats, algorithms) and indexing semantics (e.g.,
> copy on write, 1-based indexing) but should look more broadly (numpy,
> pandas) for missing functionality at DSL and API level. The overall
> vision of SystemDS is to build up a hierarchy of builtin functions for
> the entire data science lifecycle (data preparation, cleaning, training,
> scoring, debugging) while still being able to compile hybrid runtime
> plans for local CPU/GPU, distributed, and federated backends.
>
> Let's do this assessment in the main github repo (e.g., as markdown
> files in docs) before we put anything on the main website as we need to
> distinguish the assessment from actual documentation. Thanks.
>
> Regards,
> Matthias
>
> On 8/5/2022 8:23 PM, Badrul Chowdhury wrote:
> > Thank you both for your thoughtful comments. Agreed: we should not force
> > parity; rather, we should make sure that SystemDS built-in functions
> > "cover" important use cases. I will start with an audit of SystemDS's
> > existing capabilities and create a PR on systemds-website
> > <https://github.com/apache/systemds-website> with my findings. This
> would
> > also be a good way to identify gaps in the documentation for existing
> > builtins so we can update it.
> >
> > Thanks,
> > Badrul
> >
> > On Tue, 2 Aug 2022 at 06:12, arnab phani <phaniar...@gmail.com> wrote:
> >
> >> In my understanding, parity matters if 1) frameworks share a similar
> user
> >> base and use cases (sklearn, pandas, etc.)
> >> or 2) one framework shares APIs with another (dask, modin, pandas).
> >> Otherwise, forcing parity can be counterproductive. During our work on
> >> feature transformations,
> >> we have seen major differences in supported feature transformations,
> user
> >> APIs, and configurations among ML Systems.
> >> For instance, TensorFlow tunes its APIs based on the expected use cases
> >> (neural network) and data
> >> characteristics (text, image), while sklearn aims for traditional ML
> jobs.
> >> Moreover, some API changes are
> >> required to be able to use certain underlying optimizations.
> >> Having said that, It is definitely important to support popular
> builtins,
> >> however, I don't think it is necessary to
> >> use the same names, APIs, and flags. I liked the idea of writing our
> >> documentation in a way that helps new users to draw
> >> similarities with popular libraries. A capability matrix to map builtins
> >> from other systems to ours can be helpful.
> >>
> >> Regards,
> >> Arnab..
> >>
> >> On Tue, Aug 2, 2022 at 6:16 AM Janardhan <janard...@apache.org> wrote:
> >>
> >>> Hi Badrul,
> >>>
> >>> Adding to this discussion,
> >>> I think we can start with what we already have implemented. We do not
> >>> need to implement every last function, we can choose a use-case based
> >>> approach for best results. I would start with the present status of
> >>> the builtins - they are enough for a lot of use cases! then implement
> >>> one by one based on priority. Most of our builtin functions other than
> >>> ML (including NN library) are inspired from R language.
> >>>
> >>> During the implementation/testing, we might need to modify/could find
> >>> optimization opportunities for our system internals.
> >>>
> >>> One of the approaches:
> >>> 1. Take an algorithm/product that is already implemented in another
> >>> system/library.
> >>> 2. Find places where SystemDS can perform better. Find the low hanging
> >>> fruit, like can we use one of our python builtins or a combination to
> >>> achieve similar or better results. and can we improve it further.
> >>> 3. So, we identified a candidate for builtin.
> >>> 4. and repeat the cycle.
> >>>
> >>>
> >>> Best regards,
> >>> Janardhan
> >>>
> >>>
> >>>
> >>> On Tue, Aug 2, 2022 at 2:09 AM Badrul Chowdhury
> >>> <badrulchowdhur...@gmail.com> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> I wanted to start a discussion on building parity of built-in
> functions
> >>>> with popular OSS libraries. I am thinking of attaining parity as a
> >> 3-step
> >>>> process:
> >>>>
> >>>> *Step 1*
> >>>> As far as I can tell from the existing built-in functions, SystemDS
> >> aims
> >>> to
> >>>> offer a hybrid set of APIs for scientific computing and ML (data
> >>>> engineering included) to users. Therefore, the most obvious OSS
> >> libraries
> >>>> for comparison would be numpy, sklearn (scipy), and pandas. Apache
> >>>> DataSketches would be another relevant system for specialized use
> cases
> >>>> (sketches).
> >>>>
> >>>> *Step 2*
> >>>> Once we have established a set of libraries, I would propose that we
> >>> create
> >>>> a capability matrix with sections for each library, like so:
> >>>>
> >>>> Section 1: numpy
> >>>>
> >>>> f_1
> >>>>
> >>>> f_2
> >>>>
> >>>> [..]
> >>>>
> >>>>
> >>>> f_n
> >>>>
> >>>> Section 2: sklearn
> >>>>
> >>>> [..]
> >>>>
> >>>>
> >>>> The columns could be a checklist like this: f_i -> (DML, Python, CP,
> >> SP,
> >>>> RowCol, Row, Col, Federated, documentationPublished)
> >>>>
> >>>> *Step 3*
> >>>> Create JIRA tasks, assign them, and start coding.
> >>>>
> >>>>
> >>>> Thoughts?
> >>>>
> >>>>
> >>>> Thanks,
> >>>> Badrul
> >>>
> >>
> >
> >
>


-- 

Cheers,
Badrul

Re: Plan for Builtin Functions Parity with Numpy, etc

Reply via email to