Hi, I would like to suggest some documentation/usability/code tasks for the 2016 SystemML roadmap. The primary focus of these goals is to lower the barrier to entry to SystemML for these groups: (1) Users without a data science/ML background who want to try SystemML, (2) Data scientists who want to run, modify, and create DML/PyDML scripts, (3) Developers who want to contribute code to the project, and (4) Spark community who want to use the MLContext API or Spark Batch Mode.
Tasks: * Non-mathematical practical description of the purpose of each algorithm and real-world examples of problems that each algorithm solves. * Examples showing the conversion of real-world data sets (Wikipedia database, images, log files, Twitter messages, etc) to matrix-based representations for use in SystemML. * Working one-line examples of invoking each algorithm on an existing small data set (The user can copy/paste this single line and it runs). This means creating working example data files so that the user doesn't need to. These data files can be in the SystemML project, in another project, or they can be deployed to a web server and SystemML can read the data sets from URLs. * DML Cookbook to give script writers the DML building blocks they need. * DML Language Reference completely up-to-date. * PyDML Language Reference converted to markdown, clean mirror of DML Language Reference, and up-to-date. * Document DML algorithm best practices into programming guide (especially, how to write algorithms that scale efficiently). * Structure documentation to more clearly indicate the ways to invoke SystemML. * Identify heavily used classes/methods (run test suite with a profiler) and ensure these classes/methods have Javadocs and are efficient. * Create printMatrix() function to allow a user doing prototyping to see a matrix or a subset of a matrix in the console rather than having to write to a file and open the file to see the result. * If a DML function doesn't return a value, don't require an lvalue when calling the function. * Spark Batch Mode clearly documented. * Very thoroughly Javadoc the MLContext API (MLContext and related classes/methods) since it is a programmatic interface with enormous potential for the Spark community. * Address differences in data representations between Spark (RDD/DataFrame) and SystemML (binary block). Determine solution to give best performance when working on a large distributed data set while optimizing the capabilities of Spark and SystemML. Is DataFrame-to-binary-block conversion needed or is it possible to use a single format and avoid the data conversion cost? * Enhanced Spark integration, for instance ML Pipeline integration via Java or Scala algorithm wrappers. * Ensure documentation allows a user to download SystemML and run a 'Hello World' DML example and an actual algorithm in 5 minutes or less. * IDE tools such as DML editor that allows code completion. * Promote SystemML in the user community: (1) activity on mailing lists (2) talks at conferences (3) academic papers (4) blog posts (5) post information to forums such as stackoverflow Deron On Mon, Dec 21, 2015 at 3:09 AM, Matthias Boehm <[email protected]> wrote: > From my perspective, our roadmap for 2016 should cover the following > SystemML engine extensions with regard to runtime (R), optimizer (O), as > well as language and tools (L). Each sub-bullet in the following list will > be further broken down into multiple JIRAs. > > R1) Extended Scale-Up Backend > * Support for large dense matrix blocks >16GB > * Extended multi-threaded operations (e.g., ctable, aggregate) > * NUMA-awareness (partitioning and multi-threaded operations) > * Extended update-in-place support > > R2) Generalized Matrix Block Library > * Investigation interface design (abstraction) > * Boolean matrices and operations > * Different types of sparse matrix blocks > * Additional physical runtime operators > > R3) HW Accelerators / Low-Level Optimizations > * Exploit GPU BLAS libraries (integration) > * Custom GPU kernels for complex operator patterns > * Low-level optimizations (source code gen, compression) > > O1) Global Program Optimization > * Global data flow optimization (rewrites, holistic) > * Code motion (for cse, program block merge) > * Advanced loop vectorization (common patterns) > * Advanced function inlining (inlining multi-block functions) > * Extended inter-procedure analysis (independent constant propagation) > > O2) Cost Model > * Update memory budgets wrt Spark 1.6 dynamic memory management > * Extended runtime cost model for Spark (incl lazy evaluation) > * Extended execution type selection based on FLOPs > > O3) Dynamic Rewrites > * Extended matrix mult chain opt (sparsity, rewrites, ops) > * Rewrites exploiting additional statistics (e.g., min/max) > > O4) Optimizer Support R2/R3 > * Extended memory estimates for R2/R3 > * Type inference for matrix operations > * Extended cost model and operator selection > > L1) Extended Spark Interfaces > * Hardening MLContext (config, lazy eval, cleanup) > * Extended Spark ML wrappers for all algorithms > * Investigation of embedded DSL with sufficient optimization scope > > L2) New/Extended Builtin Functions > * Second order functions (apply), incl optimizer/runtime support > * Generalization of existing functions from vectors to matrices > * Additional builtin functions (e.g., var, sd, rev, rep, sign, etc) > > L3) Extended Dev Tools > * Extended statistics output (e.g., wrt Spark lazy evaluation) > * Extended benchmarking (data generators, test suites, etc) > > Once we create the individual JIRAs, we should also include a list of new > algorithms as well as additional documentation guides. > > > Regards, > Matthias > > > [image: Inactive hide details for Luciano Resende ---11/20/2015 01:50:16 > AM---Now that we are done with our 0.8.0 (non-apache) Release,]Luciano > Resende ---11/20/2015 01:50:16 AM---Now that we are done with our 0.8.0 > (non-apache) Release, and have most of our infrastructure in pl > > From: Luciano Resende <[email protected]> > To: [email protected] > Date: 11/20/2015 01:50 AM > Subject: [DISCUSS] Project Roadmap > ------------------------------ > > > > Now that we are done with our 0.8.0 (non-apache) Release, and have most of > our infrastructure in place at Apache, I would like to start some > discussion around what are some high level items you see we could be > working on the short/medium term, and start building a Roadmap, so new > contributors can easily find areas of their interest to start contributing. > > Let's have items listed on this thread, and once we have our JIRA > available, we start updating it there. > > Thanks > > -- > Luciano Resende > http://people.apache.org/~lresende > http://twitter.com/lresende1975 > http://lresende.blogspot.com/ > > >
