So I did a thing today… (which is why I haven't answered yet).
This morning I took another look at a rewrite of the `DataFrame` using an
arraymancer backend. Turns out by rethinking a bunch of things and especially
the current implementation of the `FormulaNode`, I managed to come up with a
seemingly working solution.
This is super WIP and I've only implemented `mutate`, `transmute` and `select`
so far, but first results are promising.
Essentially the `FormulaNode` from before is now compiled into a closure, which
returns a full column.
So the following formula:
f{"xSquared" ~ "x" * "x"}
Run
will assume that each string is a column of a data frame and create the
following closure:
proc(df: DataFrame): Column =
var
colx_47075074 = toTensor(df["x"], float)
colx_47075075 = toTensor(df["x"], float)
res_47075076 = newTensor[float](df.len)
for idx in 0 ..< df.len:
[]=(res_47075076, idx, colx_47075075[idx] * colx_47075074[idx])
result = toColumn res_47075076
Run
The data types for the columns and the result data type are currently based on
heuristics given things that appear in the formula. E.g. if math operators
appear it's float, if boolean operators it's bool etc.
The data frame now looks like:
DataFrame* = object
len*: int
data*: Table[string, Column]
case kind: DataFrameKind
of dfGrouped:
# a grouped data frame stores the keys of the groups and maps them to
# a set of the categories
groupMap: OrderedTable[string, HashSet[Value]]
else: discard
Run
where a `Column` is:
Column* = object
case kind*: ColKind
of colFloat: fCol*: Tensor[float]
of colInt: iCol*: Tensor[int]
of colBool: bCol*: Tensor[bool]
of colString: sCol*: Tensor[string]
of colObject: oCol*: Tensor[Value]
Run
`colObject` is the fallback for columns, which contain more than one data type.
So I only wrote a super simple for loop to get a rough idea how fast/slow this
might be:
import arraymancer_backend
import seqmath, sequtils, times
#import ggplotnim # for comparison with current implementation
proc main(df: DataFrame, num: int) =
let t0 = cpuTime()
for i in 0 ..< num:
df = df.mutate(f{"xSquared" ~ "x" * "x"})
let t1 = cpuTime()
echo "Took ", t1 - t0, " for ", num, " iter"
proc rawTensor(df: DataFrame, num: int) =
var t = newTensor[float](df.len)
let xT = df["x"].toTensor(float)
let t0 = cpuTime()
for i in 0 ..< num:
for j in 0 ..< df.len:
t[j] = xT[j] * xT[j]
let t1 = cpuTime()
echo "Took ", t1 - t0, " for ", num, " iter"
when isMainModule:
const num = 1_000_000
let x = linspace(0.0, 2.0, 1000)
let y = x.mapIt(0.12 + it * it * 0.3 + 2.2 * it * it * it)
var df = seqsToDf(x, y)
main(df)
rawTensor(df)
Run
Gives us: new DF:
* `Took 9.570060132 for 1000000 iter`
raw arraymancer tensor:
* `Took 1.034196647 for 1000000 iter` (so still some crazy overhead!)
While the old DF took 23.3 seconds for only 100,000 iterations! So about a
factor 23 slower than the new code.
Probably really bad comparison with pandas:
import numpy as np
import pandas as pd
x = np.linspace(0.0, 2.0, 1000)
y = (0.12 + x * x * 0.3 + 2.2 * x * x * x)
df = pd.DataFrame({"x" : x, "y" : y})
def call():
t0 = time.time()
num = 100000
for i in range(num):
df.assign(xSquared = df["x"] * df["x"])
t1 = time.time()
print("Took ", (t1 - t0), " for 1,000,000 iterations")
call()
Run
`Took 60.24467134475708 for 100,000 iterations` I suppose using assign and
accessing the columns like this is probably super inefficient in pandas?
And a (also not very good) comparison with `NimData`
import nimdata
import seqmath, sequtils, times, sugar
proc main =
let x = linspace(0.0, 2.0, 1000)
let y = x.mapIt(0.12 + it * it * 0.3 + 2.2 * it * it * it)
var df = DF.fromSeq(zip(x, y))
df.take(5).show()
echo df.count()
const num = 1_000_000
let t0 = cpuTime()
for i in 0 ..< num:
df = df.map(x => (x[0], x[0] * x[0])).cache()
let t1 = cpuTime()
echo "Took ", t1 - t0, " for ", num, " iter"
when isMainModule:
main()
Run
`Took 16.322826325 for 1,000,000 iter`
I'm definitely not saying the new code is faster than NimData or pandas, but
it's defintely promising!
I'll see where this takes me. I think though I managed to implement the main
things I was worried about. The rest should just be tedious work.
Will keep you all posted.