Thanks for sharing your thoughts Matthias! Ack, I will create the PR in the main repo.
Thanks, Badrul On Fri, 5 Aug 2022 at 11:42, Matthias Boehm <mboe...@gmail.com> wrote: > thanks for driving this discussion Badrul. In general, I think it's a > great idea to do an assessment of coverage as a basis for discussions > regarding further development, API consistency, and improved > documentation. At algorithm level you will encounter subtle differences > due to different algorithmic choices, implementation details, and > related parameters. Here we should not make ourselves dependent on > existing libraries but make case-by-base decisions, balancing various > constraints with the benefits of API similarity. > > By default, we stick to names of R builtin functions of selected > packages (e.g., Matrix, stats, algorithms) and indexing semantics (e.g., > copy on write, 1-based indexing) but should look more broadly (numpy, > pandas) for missing functionality at DSL and API level. The overall > vision of SystemDS is to build up a hierarchy of builtin functions for > the entire data science lifecycle (data preparation, cleaning, training, > scoring, debugging) while still being able to compile hybrid runtime > plans for local CPU/GPU, distributed, and federated backends. > > Let's do this assessment in the main github repo (e.g., as markdown > files in docs) before we put anything on the main website as we need to > distinguish the assessment from actual documentation. Thanks. > > Regards, > Matthias > > On 8/5/2022 8:23 PM, Badrul Chowdhury wrote: > > Thank you both for your thoughtful comments. Agreed: we should not force > > parity; rather, we should make sure that SystemDS built-in functions > > "cover" important use cases. I will start with an audit of SystemDS's > > existing capabilities and create a PR on systemds-website > > <https://github.com/apache/systemds-website> with my findings. This > would > > also be a good way to identify gaps in the documentation for existing > > builtins so we can update it. > > > > Thanks, > > Badrul > > > > On Tue, 2 Aug 2022 at 06:12, arnab phani <phaniar...@gmail.com> wrote: > > > >> In my understanding, parity matters if 1) frameworks share a similar > user > >> base and use cases (sklearn, pandas, etc.) > >> or 2) one framework shares APIs with another (dask, modin, pandas). > >> Otherwise, forcing parity can be counterproductive. During our work on > >> feature transformations, > >> we have seen major differences in supported feature transformations, > user > >> APIs, and configurations among ML Systems. > >> For instance, TensorFlow tunes its APIs based on the expected use cases > >> (neural network) and data > >> characteristics (text, image), while sklearn aims for traditional ML > jobs. > >> Moreover, some API changes are > >> required to be able to use certain underlying optimizations. > >> Having said that, It is definitely important to support popular > builtins, > >> however, I don't think it is necessary to > >> use the same names, APIs, and flags. I liked the idea of writing our > >> documentation in a way that helps new users to draw > >> similarities with popular libraries. A capability matrix to map builtins > >> from other systems to ours can be helpful. > >> > >> Regards, > >> Arnab.. > >> > >> On Tue, Aug 2, 2022 at 6:16 AM Janardhan <janard...@apache.org> wrote: > >> > >>> Hi Badrul, > >>> > >>> Adding to this discussion, > >>> I think we can start with what we already have implemented. We do not > >>> need to implement every last function, we can choose a use-case based > >>> approach for best results. I would start with the present status of > >>> the builtins - they are enough for a lot of use cases! then implement > >>> one by one based on priority. Most of our builtin functions other than > >>> ML (including NN library) are inspired from R language. > >>> > >>> During the implementation/testing, we might need to modify/could find > >>> optimization opportunities for our system internals. > >>> > >>> One of the approaches: > >>> 1. Take an algorithm/product that is already implemented in another > >>> system/library. > >>> 2. Find places where SystemDS can perform better. Find the low hanging > >>> fruit, like can we use one of our python builtins or a combination to > >>> achieve similar or better results. and can we improve it further. > >>> 3. So, we identified a candidate for builtin. > >>> 4. and repeat the cycle. > >>> > >>> > >>> Best regards, > >>> Janardhan > >>> > >>> > >>> > >>> On Tue, Aug 2, 2022 at 2:09 AM Badrul Chowdhury > >>> <badrulchowdhur...@gmail.com> wrote: > >>>> > >>>> Hi, > >>>> > >>>> I wanted to start a discussion on building parity of built-in > functions > >>>> with popular OSS libraries. I am thinking of attaining parity as a > >> 3-step > >>>> process: > >>>> > >>>> *Step 1* > >>>> As far as I can tell from the existing built-in functions, SystemDS > >> aims > >>> to > >>>> offer a hybrid set of APIs for scientific computing and ML (data > >>>> engineering included) to users. Therefore, the most obvious OSS > >> libraries > >>>> for comparison would be numpy, sklearn (scipy), and pandas. Apache > >>>> DataSketches would be another relevant system for specialized use > cases > >>>> (sketches). > >>>> > >>>> *Step 2* > >>>> Once we have established a set of libraries, I would propose that we > >>> create > >>>> a capability matrix with sections for each library, like so: > >>>> > >>>> Section 1: numpy > >>>> > >>>> f_1 > >>>> > >>>> f_2 > >>>> > >>>> [..] > >>>> > >>>> > >>>> f_n > >>>> > >>>> Section 2: sklearn > >>>> > >>>> [..] > >>>> > >>>> > >>>> The columns could be a checklist like this: f_i -> (DML, Python, CP, > >> SP, > >>>> RowCol, Row, Col, Federated, documentationPublished) > >>>> > >>>> *Step 3* > >>>> Create JIRA tasks, assign them, and start coding. > >>>> > >>>> > >>>> Thoughts? > >>>> > >>>> > >>>> Thanks, > >>>> Badrul > >>> > >> > > > > > -- Cheers, Badrul