Hey! As many of you will be aware by now, I started to write a port of [ggplot2](https://ggplot2.tidyverse.org/) some time mid last year:
[https://github.com/Vindaar/ggplotnim](https://github.com/Vindaar/ggplotnim) After many sometimes frantic sessions working on this, I'm finally approaching a first personal milestone: Essentially all features I consider essential for a plotting library (for my personal use cases!) are (or are about to be) implemented. This will mark the release of version `v0.3.0`. The remaining features I will implement in the next few days are: * `geom_density`: to create smooth density estimates of continuous variables using kernel density estimation (KDE). I've implemented a naive KDE with complexity `O(m x n)` for testing and it works very well (but it's very slow obviously). I want to improve that before merging it. If anyone has a good resource for a simple to implement but reasonably performant KDE implementation / algorithm, feel free to post it! * `geom_ridgeline`: ridgeline plots (or joyplots) are fun and pretty! Should be straightforward to implement. * re-activate `facet_wrap`: `facet_wrap` has been dormant for a few months now, because an internal rewrite broke them at some point. The implementation is there, but I need to fix the layouting, which is even more broken now than before. But that should also be fairly easy. Now, the main reason I open this topic is to ask all of you about what I should focus on once the above is done. # Possible things to work on There are several ideas I have in my mind, but definitely not the time to tackle them at the same time. They are: ## properly implement the Vega-Lite backend One of the main goals I had in mind when starting this whole project was to provide two different plotting backends. One native target to produce plots locally, fast and statically. On the other hand, originally inspired by @mratsim's [monocle](https://github.com/numforge/monocle), a [Vega-Lite](https://vega.github.io/vega-lite/) backend to scratch that interactive / web based itch, which allows for easy sharing of plots **including data**! I wrote a [proof of concept](https://github.com/Vindaar/ggplotnim#experimental-vega-lite-backend) and by now I have a pretty good idea (barring a lack of Vega experience) on how to implement this. Essentially the whole processing of the plot as is done now remains the same. This allows to make use of the whole functionality of `ggplotnim` without having to do a lot of duplication. The drawing code will be replaced by a mapping to JSON instead. The major work would be involved defining said mapping. If I'm lucky I can even write it as a [ginger backend](https://github.com/Vindaar/ginger/blob/master/src/ginger/backendCairo.nim) with a - for Vega pretty obscure - API (`drawPoint`, `drawLine`, etc. essentially just adding data to a `JsonNode`). More likely it'll involve replacing the [drawing portion](https://github.com/Vindaar/ggplotnim/blob/master/src/ggplotnim/ggplot_drawing.nim#L342-L364) of `ggplotnim` Vega related drawing equivalents. ## improve `DataFrame` performance The included data frame in `ggplotnim` is - for many operations anyways - abysmally slow. While performance is nice, I mainly wanted something to work with "right now" instead of spending a lot of time writing a performant data frame. The reasons for the performance are three-fold, as far as I can tell: * for some operations the algorithms used are inefficient * the underlying data type is a `Value` similar to a `JsonNode`. Conversion to and from normal types is slow and operations on `Value` are also slow, since there are always case statements involved and at least one indirection to access the actual value. * each column is a `PersistentVector[Value]`. For most operations this is a major performance boost over a `seq[Value]`, since we avoid a large amounts of copying. However, iterating over long vectors or building long vectors is slow. One thing to improve performance would be to include the distinction between a pure column of one data type and `Value` columns (which are somewhat similar to `object` types in numpy / pandas if my superficial understanding of those is correct). While I'm not certain, I believe that distinction alone would make the code a lot more complex and would definitely require a lot of use of generics. Generics is something I specifically wanted to avoid in context of a data frame, because each time I played around with toy data frames this became a headache. The only idea to avoid generics would be to extend a `Value` to also have a case for vector like data, similar to `JsonNodes` `JArray`. That would double the number of fields though. In any case, if I were to seriously attempt to improve performance of the data frames, I would stop messing around myself and first do some research into how data frames are handled elsewhere. Again, if anyone is familiar with resources, feel free to share them! ## improve documentation This is pretty self explanatory. The main documentation is definitely lacking as it is right now. I hope that the [recipes](https://github.com/Vindaar/ggplotnim/blob/master/recipes.org) provide anyone of you who tried to play around with the library with a reasonable alternative for the time being! ## implement more statistics related ggplot2 functionality There's a lot of functionality in `ggplot2` that makes it a proper R package. Namely a lot of stats related functionality. Simple things like box and violing plots, smoothing and error bands and probably a lot more I'm not even aware of, since I don't really use that stuff. If that's something people want, I could defnitely work on that. At least box and violing plots and simple loess smoothing is something I'll implement at some point anyways. If there's something else you consider essential, let me know. ## write a shared library for use in C / C++ This is a fun idea I had a while back. As far as I'm aware our poor fellows who are stuck working with C and C++ don't **really** have a great plotting library to work with. Some of them are very powerful but hard to use, some produce not very nice looking plots and some others (looking at you ROOT) bring along an oil tanker of dependencies. Maybe I'm missing something, but as far as I can tell there are **many** people who do their calculations in C/C++, dump the data and use python for plotting. I'm not sure if people would be interested in such a thing, but Nim being awesome would allow for a shared library to essentially make use of all of `ggplotnims` functionality from C. I wrote a small scatter plot function and it worked perfectly. Maybe this wouldn't work out as well as I think right now, but it feels like maybe a great opportunity for Nim to shine. ## add `DateTime` support Currently dates and times are not supported in `ggplotnim`. I'm being made aware of this every day at the moment, thanks to many COVID-19 plots I see daily, which have dates on their x axis… This is definitely a must have, but I haven't really thought about how I want to implement this. ## attempt to allow typsetting text using LaTeX Once the time comes for me to write my thesis, I will probably want more power over the type setting on the plots, especially to choose arbitrary fonts and set equations etc. One main feature `matplotlib` still has over `ggplotnim` for me is to easily put LaTeX onto plots. This is something I will attempt to implement at some point this year. I don't yet know how to best do it, but I have a few ideas. I could go crazy and write a `tikz` backend for `ginger` I suppose. I'm not familiar enough with `tikz` to know how flexible it is, but it seems doable. Or I could split the non text based plot and the text based stuff into two outputs, dump the text into a LaTeX template, compile it and merge the two files. I'll figure something out. ## something else entirely? If you have any other ideas or maybe I'm missing something important, feel free to let me know. Either post it here, or open an issue on the repository. # To sum it all up Sorry for rambling so long. ;) In general I want to encourage anyone who tries out `ggplotnim` to feel free to open issues on the repository freely. Please don't think it's strictly for bug reports. If you struggle using the library chances are I'm at fault. Either the documentation sucks, you're using it in ways I didn't foresee, which may thus be cumbersome, etc. I'll try to help out as best as I can! Thanks for reading! I'll use this thread to update you on releases and post changelogs in the future.
