Hey!

As many of you will be aware by now, I started to write a port of 
[ggplot2](https://ggplot2.tidyverse.org/) some time mid last year:

[https://github.com/Vindaar/ggplotnim](https://github.com/Vindaar/ggplotnim)

After many sometimes frantic sessions working on this, I'm finally approaching 
a first personal milestone: Essentially all features I consider essential for a 
plotting library (for my personal use cases!) are (or are about to be) 
implemented. This will mark the release of version `v0.3.0`.

The remaining features I will implement in the next few days are:

  * `geom_density`: to create smooth density estimates of continuous variables 
using kernel density estimation (KDE). I've implemented a naive KDE with 
complexity `O(m x n)` for testing and it works very well (but it's very slow 
obviously). I want to improve that before merging it. If anyone has a good 
resource for a simple to implement but reasonably performant KDE implementation 
/ algorithm, feel free to post it!
  * `geom_ridgeline`: ridgeline plots (or joyplots) are fun and pretty! Should 
be straightforward to implement.
  * re-activate `facet_wrap`: `facet_wrap` has been dormant for a few months 
now, because an internal rewrite broke them at some point. The implementation 
is there, but I need to fix the layouting, which is even more broken now than 
before. But that should also be fairly easy.



Now, the main reason I open this topic is to ask all of you about what I should 
focus on once the above is done.

# Possible things to work on

There are several ideas I have in my mind, but definitely not the time to 
tackle them at the same time. They are:

## properly implement the Vega-Lite backend

One of the main goals I had in mind when starting this whole project was to 
provide two different plotting backends. One native target to produce plots 
locally, fast and statically.

On the other hand, originally inspired by @mratsim's 
[monocle](https://github.com/numforge/monocle), a 
[Vega-Lite](https://vega.github.io/vega-lite/) backend to scratch that 
interactive / web based itch, which allows for easy sharing of plots 
**including data**!

I wrote a [proof of 
concept](https://github.com/Vindaar/ggplotnim#experimental-vega-lite-backend) 
and by now I have a pretty good idea (barring a lack of Vega experience) on how 
to implement this.

Essentially the whole processing of the plot as is done now remains the same. 
This allows to make use of the whole functionality of `ggplotnim` without 
having to do a lot of duplication. The drawing code will be replaced by a 
mapping to JSON instead.

The major work would be involved defining said mapping. If I'm lucky I can even 
write it as a [ginger 
backend](https://github.com/Vindaar/ginger/blob/master/src/ginger/backendCairo.nim)
 with a - for Vega pretty obscure - API (`drawPoint`, `drawLine`, etc. 
essentially just adding data to a `JsonNode`). More likely it'll involve 
replacing the [drawing 
portion](https://github.com/Vindaar/ggplotnim/blob/master/src/ggplotnim/ggplot_drawing.nim#L342-L364)
 of `ggplotnim` Vega related drawing equivalents.

## improve `DataFrame` performance

The included data frame in `ggplotnim` is - for many operations anyways - 
abysmally slow.

While performance is nice, I mainly wanted something to work with "right now" 
instead of spending a lot of time writing a performant data frame.

The reasons for the performance are three-fold, as far as I can tell:

  * for some operations the algorithms used are inefficient
  * the underlying data type is a `Value` similar to a `JsonNode`. Conversion 
to and from normal types is slow and operations on `Value` are also slow, since 
there are always case statements involved and at least one indirection to 
access the actual value.
  * each column is a `PersistentVector[Value]`. For most operations this is a 
major performance boost over a `seq[Value]`, since we avoid a large amounts of 
copying. However, iterating over long vectors or building long vectors is slow.



One thing to improve performance would be to include the distinction between a 
pure column of one data type and `Value` columns (which are somewhat similar to 
`object` types in numpy / pandas if my superficial understanding of those is 
correct).

While I'm not certain, I believe that distinction alone would make the code a 
lot more complex and would definitely require a lot of use of generics. 
Generics is something I specifically wanted to avoid in context of a data 
frame, because each time I played around with toy data frames this became a 
headache.

The only idea to avoid generics would be to extend a `Value` to also have a 
case for vector like data, similar to `JsonNodes` `JArray`. That would double 
the number of fields though.

In any case, if I were to seriously attempt to improve performance of the data 
frames, I would stop messing around myself and first do some research into how 
data frames are handled elsewhere.

Again, if anyone is familiar with resources, feel free to share them!

## improve documentation

This is pretty self explanatory. The main documentation is definitely lacking 
as it is right now.

I hope that the 
[recipes](https://github.com/Vindaar/ggplotnim/blob/master/recipes.org) provide 
anyone of you who tried to play around with the library with a reasonable 
alternative for the time being!

## implement more statistics related ggplot2 functionality

There's a lot of functionality in `ggplot2` that makes it a proper R package. 
Namely a lot of stats related functionality. Simple things like box and violing 
plots, smoothing and error bands and probably a lot more I'm not even aware of, 
since I don't really use that stuff.

If that's something people want, I could defnitely work on that.

At least box and violing plots and simple loess smoothing is something I'll 
implement at some point anyways. If there's something else you consider 
essential, let me know.

## write a shared library for use in C / C++

This is a fun idea I had a while back. As far as I'm aware our poor fellows who 
are stuck working with C and C++ don't **really** have a great plotting library 
to work with.

Some of them are very powerful but hard to use, some produce not very nice 
looking plots and some others (looking at you ROOT) bring along an oil tanker 
of dependencies.

Maybe I'm missing something, but as far as I can tell there are **many** people 
who do their calculations in C/C++, dump the data and use python for plotting.

I'm not sure if people would be interested in such a thing, but Nim being 
awesome would allow for a shared library to essentially make use of all of 
`ggplotnims` functionality from C. I wrote a small scatter plot function and it 
worked perfectly.

Maybe this wouldn't work out as well as I think right now, but it feels like 
maybe a great opportunity for Nim to shine.

## add `DateTime` support

Currently dates and times are not supported in `ggplotnim`. I'm being made 
aware of this every day at the moment, thanks to many COVID-19 plots I see 
daily, which have dates on their x axis…

This is definitely a must have, but I haven't really thought about how I want 
to implement this.

## attempt to allow typsetting text using LaTeX

Once the time comes for me to write my thesis, I will probably want more power 
over the type setting on the plots, especially to choose arbitrary fonts and 
set equations etc.

One main feature `matplotlib` still has over `ggplotnim` for me is to easily 
put LaTeX onto plots.

This is something I will attempt to implement at some point this year. I don't 
yet know how to best do it, but I have a few ideas.

I could go crazy and write a `tikz` backend for `ginger` I suppose. I'm not 
familiar enough with `tikz` to know how flexible it is, but it seems doable.

Or I could split the non text based plot and the text based stuff into two 
outputs, dump the text into a LaTeX template, compile it and merge the two 
files.

I'll figure something out.

## something else entirely?

If you have any other ideas or maybe I'm missing something important, feel free 
to let me know. Either post it here, or open an issue on the repository.

# To sum it all up

Sorry for rambling so long. ;)

In general I want to encourage anyone who tries out `ggplotnim` to feel free to 
open issues on the repository freely. Please don't think it's strictly for bug 
reports. If you struggle using the library chances are I'm at fault. Either the 
documentation sucks, you're using it in ways I didn't foresee, which may thus 
be cumbersome, etc. I'll try to help out as best as I can!

Thanks for reading!

I'll use this thread to update you on releases and post changelogs in the 
future. 

Reply via email to