@chemist69: Thanks! And by the way, I forgot to mention the most important
feature from a practical perspective: There already is a simple browser viewer
available via `df.openInBrowser()`. No plotting yet, but at least handy for
quick data inspection.
@jlp765: Handling different types is already possible, as illustrated in the
readme example as well. The main difference between to dynamically typed APIs
like Pandas is that you once have to tell the compiler about your schema -- but
then you can fully benefit from type-safety in the processing pipeline. Let's
say your CSV has columns _name_ (string), _age_ (int), _height_ (float),
_birthday_ (date), then your code would look like this:
const schema = [
col(StrCol, "name"),
col(IntCol, "age"),
col(FloatCol, "height"),
col(DateCol, "birthday") # DateCol not yet implemented, but coming soon
]
let df = DF.fromText("data.csv.gz").map(schemaParser(schema, ','))
What happens here, is that the `schemaParser` macro builds a parser proc which
takes a string as input and returns a named tuple of type `tuple[name: string,
age: int64, height: float, birthday: SomeDateTimeTypeTBD]` (note that this
allows to generate highly customized machine code, which is why the parser can
be much faster than generic parsers). So yes, the data frame only holds a
single type, but that type is heterogenous, and you can extract the individual
"columns" back by e.g. `df.map(x => x.name)` giving you a `DataFrame[string]`
instead of the full tuple.
Having to specify the schema might look tedious from a Pandas perspective. But
the big benefit is that you can never get the column names or types wrong. In
Pandas you see a lot of code which just says `def preprocess_data(df)`, and it
is neither clear what `df` really contains nor what assumptions
`preprocess_data` makes on the data. This can be solved by extensive
documentation & testing, but is still difficult to maintain in big projects.
With a type safe schema the assumptions about the data become explicit in the
code, and the compiler can ensure that they are satisfied.
Global aggregation is already available. You could do for instance `df.map(x =>
x.age).mean()` to get the average age. There is also `reduce`/`fold`, which
allows to implement custom aggregation functions. What's still missing is
`groupBy` and `join`, but they are high priority for me as well, so I hope I
can add them soon.