Hi All, Following up on this thread. I have created a PR with the basic template for the comparison here: https://github.com/apache/systemds/pull/1735
Please feel free to comment on the outline for the survey or suggest ideas. I can start filling in the details of the actual comparison once we agree on the template for comparison. Thanks, Badrul On Fri, 5 Aug 2022 at 23:55, Badrul Chowdhury <[email protected]> wrote: > Thanks for sharing your thoughts Matthias! Ack, I will create the PR in > the main repo. > > Thanks, > Badrul > > On Fri, 5 Aug 2022 at 11:42, Matthias Boehm <[email protected]> wrote: > >> thanks for driving this discussion Badrul. In general, I think it's a >> great idea to do an assessment of coverage as a basis for discussions >> regarding further development, API consistency, and improved >> documentation. At algorithm level you will encounter subtle differences >> due to different algorithmic choices, implementation details, and >> related parameters. Here we should not make ourselves dependent on >> existing libraries but make case-by-base decisions, balancing various >> constraints with the benefits of API similarity. >> >> By default, we stick to names of R builtin functions of selected >> packages (e.g., Matrix, stats, algorithms) and indexing semantics (e.g., >> copy on write, 1-based indexing) but should look more broadly (numpy, >> pandas) for missing functionality at DSL and API level. The overall >> vision of SystemDS is to build up a hierarchy of builtin functions for >> the entire data science lifecycle (data preparation, cleaning, training, >> scoring, debugging) while still being able to compile hybrid runtime >> plans for local CPU/GPU, distributed, and federated backends. >> >> Let's do this assessment in the main github repo (e.g., as markdown >> files in docs) before we put anything on the main website as we need to >> distinguish the assessment from actual documentation. Thanks. >> >> Regards, >> Matthias >> >> On 8/5/2022 8:23 PM, Badrul Chowdhury wrote: >> > Thank you both for your thoughtful comments. Agreed: we should not force >> > parity; rather, we should make sure that SystemDS built-in functions >> > "cover" important use cases. I will start with an audit of SystemDS's >> > existing capabilities and create a PR on systemds-website >> > <https://github.com/apache/systemds-website> with my findings. This >> would >> > also be a good way to identify gaps in the documentation for existing >> > builtins so we can update it. >> > >> > Thanks, >> > Badrul >> > >> > On Tue, 2 Aug 2022 at 06:12, arnab phani <[email protected]> wrote: >> > >> >> In my understanding, parity matters if 1) frameworks share a similar >> user >> >> base and use cases (sklearn, pandas, etc.) >> >> or 2) one framework shares APIs with another (dask, modin, pandas). >> >> Otherwise, forcing parity can be counterproductive. During our work on >> >> feature transformations, >> >> we have seen major differences in supported feature transformations, >> user >> >> APIs, and configurations among ML Systems. >> >> For instance, TensorFlow tunes its APIs based on the expected use cases >> >> (neural network) and data >> >> characteristics (text, image), while sklearn aims for traditional ML >> jobs. >> >> Moreover, some API changes are >> >> required to be able to use certain underlying optimizations. >> >> Having said that, It is definitely important to support popular >> builtins, >> >> however, I don't think it is necessary to >> >> use the same names, APIs, and flags. I liked the idea of writing our >> >> documentation in a way that helps new users to draw >> >> similarities with popular libraries. A capability matrix to map >> builtins >> >> from other systems to ours can be helpful. >> >> >> >> Regards, >> >> Arnab.. >> >> >> >> On Tue, Aug 2, 2022 at 6:16 AM Janardhan <[email protected]> wrote: >> >> >> >>> Hi Badrul, >> >>> >> >>> Adding to this discussion, >> >>> I think we can start with what we already have implemented. We do not >> >>> need to implement every last function, we can choose a use-case based >> >>> approach for best results. I would start with the present status of >> >>> the builtins - they are enough for a lot of use cases! then implement >> >>> one by one based on priority. Most of our builtin functions other than >> >>> ML (including NN library) are inspired from R language. >> >>> >> >>> During the implementation/testing, we might need to modify/could find >> >>> optimization opportunities for our system internals. >> >>> >> >>> One of the approaches: >> >>> 1. Take an algorithm/product that is already implemented in another >> >>> system/library. >> >>> 2. Find places where SystemDS can perform better. Find the low hanging >> >>> fruit, like can we use one of our python builtins or a combination to >> >>> achieve similar or better results. and can we improve it further. >> >>> 3. So, we identified a candidate for builtin. >> >>> 4. and repeat the cycle. >> >>> >> >>> >> >>> Best regards, >> >>> Janardhan >> >>> >> >>> >> >>> >> >>> On Tue, Aug 2, 2022 at 2:09 AM Badrul Chowdhury >> >>> <[email protected]> wrote: >> >>>> >> >>>> Hi, >> >>>> >> >>>> I wanted to start a discussion on building parity of built-in >> functions >> >>>> with popular OSS libraries. I am thinking of attaining parity as a >> >> 3-step >> >>>> process: >> >>>> >> >>>> *Step 1* >> >>>> As far as I can tell from the existing built-in functions, SystemDS >> >> aims >> >>> to >> >>>> offer a hybrid set of APIs for scientific computing and ML (data >> >>>> engineering included) to users. Therefore, the most obvious OSS >> >> libraries >> >>>> for comparison would be numpy, sklearn (scipy), and pandas. Apache >> >>>> DataSketches would be another relevant system for specialized use >> cases >> >>>> (sketches). >> >>>> >> >>>> *Step 2* >> >>>> Once we have established a set of libraries, I would propose that we >> >>> create >> >>>> a capability matrix with sections for each library, like so: >> >>>> >> >>>> Section 1: numpy >> >>>> >> >>>> f_1 >> >>>> >> >>>> f_2 >> >>>> >> >>>> [..] >> >>>> >> >>>> >> >>>> f_n >> >>>> >> >>>> Section 2: sklearn >> >>>> >> >>>> [..] >> >>>> >> >>>> >> >>>> The columns could be a checklist like this: f_i -> (DML, Python, CP, >> >> SP, >> >>>> RowCol, Row, Col, Federated, documentationPublished) >> >>>> >> >>>> *Step 3* >> >>>> Create JIRA tasks, assign them, and start coding. >> >>>> >> >>>> >> >>>> Thoughts? >> >>>> >> >>>> >> >>>> Thanks, >> >>>> Badrul >> >>> >> >> >> > >> > >> > > > -- > > Cheers, > Badrul > -- Cheers, Badrul
