So I did a thing today… (which is why I haven't answered yet).

This morning I took another look at a rewrite of the `DataFrame` using an 
arraymancer backend. Turns out by rethinking a bunch of things and especially 
the current implementation of the `FormulaNode`, I managed to come up with a 
seemingly working solution.

This is super WIP and I've only implemented `mutate`, `transmute` and `select` 
so far, but first results are promising.

Essentially the `FormulaNode` from before is now compiled into a closure, which 
returns a full column.

So the following formula:
    
    
    f{"xSquared" ~ "x" * "x"}
    
    Run

will assume that each string is a column of a data frame and create the 
following closure:
    
    
    proc(df: DataFrame): Column =
      var
        colx_47075074 = toTensor(df["x"], float)
        colx_47075075 = toTensor(df["x"], float)
        res_47075076 = newTensor[float](df.len)
      for idx in 0 ..< df.len:
        []=(res_47075076, idx, colx_47075075[idx] * colx_47075074[idx])
      result = toColumn res_47075076
    
    Run

The data types for the columns and the result data type are currently based on 
heuristics given things that appear in the formula. E.g. if math operators 
appear it's float, if boolean operators it's bool etc.

The data frame now looks like:
    
    
    DataFrame* = object
      len*: int
      data*: Table[string, Column]
      case kind: DataFrameKind
      of dfGrouped:
        # a grouped data frame stores the keys of the groups and maps them to
        # a set of the categories
        groupMap: OrderedTable[string, HashSet[Value]]
      else: discard
    
    Run

where a `Column` is:
    
    
    Column* = object
      case kind*: ColKind
      of colFloat: fCol*: Tensor[float]
      of colInt: iCol*: Tensor[int]
      of colBool: bCol*: Tensor[bool]
      of colString: sCol*: Tensor[string]
      of colObject: oCol*: Tensor[Value]
    
    Run

`colObject` is the fallback for columns, which contain more than one data type.

So I only wrote a super simple for loop to get a rough idea how fast/slow this 
might be:
    
    
    import arraymancer_backend
    import seqmath, sequtils, times
    #import ggplotnim # for comparison with current implementation
    
    proc main(df: DataFrame, num: int) =
      let t0 = cpuTime()
      for i in 0 ..< num:
        df = df.mutate(f{"xSquared" ~ "x" * "x"})
      let t1 = cpuTime()
      echo "Took ", t1 - t0, " for ", num, " iter"
    
    proc rawTensor(df: DataFrame, num: int) =
      var t = newTensor[float](df.len)
      let xT = df["x"].toTensor(float)
      let t0 = cpuTime()
      for i in 0 ..< num:
        for j in 0 ..< df.len:
          t[j] = xT[j] * xT[j]
      let t1 = cpuTime()
      echo "Took ", t1 - t0, " for ", num, " iter"
    
    when isMainModule:
      const num = 1_000_000
      let x = linspace(0.0, 2.0, 1000)
      let y = x.mapIt(0.12 + it * it * 0.3 + 2.2 * it * it * it)
      var df = seqsToDf(x, y)
      main(df)
      rawTensor(df)
    
    Run

Gives us: new DF:

  * `Took 9.570060132 for 1000000 iter`



raw arraymancer tensor:

  * `Took 1.034196647 for 1000000 iter` (so still some crazy overhead!)



While the old DF took 23.3 seconds for only 100,000 iterations! So about a 
factor 23 slower than the new code.

Probably really bad comparison with pandas:
    
    
    import numpy as np
    import pandas as pd
    x = np.linspace(0.0, 2.0, 1000)
    y = (0.12 + x * x * 0.3 + 2.2 * x * x * x)
    
    df = pd.DataFrame({"x" : x, "y" : y})
    def call():
        t0 = time.time()
        num = 100000
        for i in range(num):
            df.assign(xSquared = df["x"] * df["x"])
        t1 = time.time()
        print("Took ", (t1 - t0), " for 1,000,000 iterations")
    call()
    
    Run

`Took 60.24467134475708 for 100,000 iterations` I suppose using assign and 
accessing the columns like this is probably super inefficient in pandas?

And a (also not very good) comparison with `NimData`
    
    
    import nimdata
    
    import seqmath, sequtils, times, sugar
    
    proc main =
      let x = linspace(0.0, 2.0, 1000)
      let y = x.mapIt(0.12 + it * it * 0.3 + 2.2 * it * it * it)
      var df = DF.fromSeq(zip(x, y))
      df.take(5).show()
      echo df.count()
      
      const num = 1_000_000
      let t0 = cpuTime()
      for i in 0 ..< num:
        df = df.map(x => (x[0], x[0] * x[0])).cache()
      let t1 = cpuTime()
      echo "Took ", t1 - t0, " for ", num, " iter"
    
    when isMainModule:
      main()
    
    Run

`Took 16.322826325 for 1,000,000 iter`

I'm definitely not saying the new code is faster than NimData or pandas, but 
it's defintely promising!

I'll see where this takes me. I think though I managed to implement the main 
things I was worried about. The rest should just be tedious work.

Will keep you all posted. 

Reply via email to