[Vaex]() is a (lazy) out-of-core DataFrame library for Python that is used to 
visualize and explore big tabular data at ~ a billion rows per second (on a 
decent computer/laptop). The visualization part of vaex is similar to 
datashader (see https://github.com/apache/incubator-superset/issues/4492), but 
vaex is more general.

Vaex focusses strongly on binned statistics on N-d grids, and instead of the 
groupby, uses binby which can be used for instance to create 1d histograms:
```python
x_counts = ds.count(binby=ds.x, limits=[-10, 10], shape=64)
```
Or a 2d array with means of a column:
```python
z_mean_map = ds.mean(ds.z, binby=[ds.x, ds.y], limits=[[-10, 10], [-20, 20]], 
shape=(64, 128))
```

I thought it would be interesting to see if I could integrate this in superset, 
hence this PR, which is only a proof of concept. 

I managed to get some visualizations up using the New York Taxi dataset: 
https://docs.vaex.io/en/latest/datasets.html which contains over 1 billion rows 
(although for this test I only used the 2015 data, which contains ~150 million 
rows).

I got the table view working:
<img width="1202" alt="screen shot 2018-10-03 at 20 51 05" 
src="https://user-images.githubusercontent.com/1765949/46520723-50fc9d80-c87d-11e8-943b-29ee5ca54a65.png";>

Pie charts:
<img width="1854" alt="screen shot 2018-10-03 at 21 04 43" 
src="https://user-images.githubusercontent.com/1765949/46520800-a20c9180-c87d-11e8-8fd2-28650505baa1.png";>

And time series:
<img width="1845" alt="screen shot 2018-10-04 at 20 47 10" 
src="https://user-images.githubusercontent.com/1765949/46520822-b18bda80-c87d-11e8-8bdd-1625b2d8a4dc.png";>

And I think the most beautiful one is the heatmap:
<img width="1000" alt="screen shot 2018-10-03 at 21 38 41" 
src="https://user-images.githubusercontent.com/1765949/46520859-d41df380-c87d-11e8-8307-437dae3af7a9.png";>

Note that the data to produce these viz just takes a fraction of a second for 
these ~150 million rows, 1 ~1 billion rows per second is about the expected 
performance (per computer). 

I just put this out here so judge interest in this, and as an additional 
example to https://github.com/apache/incubator-superset/pull/3492

[ Full content available at: 
https://github.com/apache/incubator-superset/pull/6041 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to