Re: Designing a data frame API

bluenote Mon, 20 Feb 2017 23:15:05 +0100

@chemist69: Thanks! And by the way, I forgot to mention the most important 
feature from a practical perspective: There already is a simple browser viewer 
available via `df.openInBrowser()`. No plotting yet, but at least handy for 
quick data inspection.


@jlp765: Handling different types is already possible, as illustrated in the 
readme example as well. The main difference between to dynamically typed APIs 
like Pandas is that you once have to tell the compiler about your schema -- but 
then you can fully benefit from type-safety in the processing pipeline. Let's 
say your CSV has columns _name_ (string), _age_ (int), _height_ (float), 
_birthday_ (date), then your code would look like this:
    
    
    const schema = [
      col(StrCol, "name"),
      col(IntCol, "age"),
      col(FloatCol, "height"),
      col(DateCol, "birthday") # DateCol not yet implemented, but coming soon
    ]
    let df = DF.fromText("data.csv.gz").map(schemaParser(schema, ','))
    

What happens here, is that the `schemaParser` macro builds a parser proc which 
takes a string as input and returns a named tuple of type `tuple[name: string, 
age: int64, height: float, birthday: SomeDateTimeTypeTBD]` (note that this 
allows to generate highly customized machine code, which is why the parser can 
be much faster than generic parsers). So yes, the data frame only holds a 
single type, but that type is heterogenous, and you can extract the individual 
"columns" back by e.g. `df.map(x => x.name)` giving you a `DataFrame[string]` 
instead of the full tuple.

Having to specify the schema might look tedious from a Pandas perspective. But 
the big benefit is that you can never get the column names or types wrong. In 
Pandas you see a lot of code which just says `def preprocess_data(df)`, and it 
is neither clear what `df` really contains nor what assumptions 
`preprocess_data` makes on the data. This can be solved by extensive 
documentation & testing, but is still difficult to maintain in big projects. 
With a type safe schema the assumptions about the data become explicit in the 
code, and the compiler can ensure that they are satisfied.

Global aggregation is already available. You could do for instance `df.map(x => 
x.age).mean()` to get the average age. There is also `reduce`/`fold`, which 
allows to implement custom aggregation functions. What's still missing is 
`groupBy` and `join`, but they are high priority for me as well, so I hope I 
can add them soon.

Re: Designing a data frame API

Reply via email to