Re: Designing a data frame API

bluenote Tue, 21 Feb 2017 20:00:05 +0100

@andrea: You are right, that is probably not yet very clear from the docs. They 
currently rely too much on being familiar with the concept of caching in 
frameworks like Spark (in fact, the API is modelled after Spark's 
[RDDs](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD),
 but for now without the distributed aspect).


The idea is basically to offer both lazy operations and eager operations. If 
your code looks like this
    
    
    let df = DF.fromFile("data.csv").map(schemaParser, ',')
    let averageAge = df.map(x => x.age).mean()
    let maxHeight = df.map(x => x.height).max()
    let longestName = df.map(x = x.name.len).max()
    

the operations are completely lazy, i.e., you would read the file three times 
from scratch, apply the parser each time, and continue with the respective 
operation. You would use this approach when your data is huge and it can't fit 
into memory entirely. This is obviously inefficient for small data which would 
fit into memory.

If you write the exact same code, but replace the first line by 
    
    
    let df = DF.fromFile("data.csv").map(schemaParser, ',').cache()
    

the result of the mapping operation would be persisted in memory (using a 
`seq[T]` internally). Thus, all following operations will use the cache, and 
you would have to read and parse the file only once. Caching gives very good 
control over trading off memory usage over recomputation. You can cache the 
computation pipeline at any point, as long as you have the required memory. 
Typically you would use `cache` after having done some (potentially complex) 
computations which are required for multiple consecutive computations like 
repeated iteration in machine learning algorithms. In some use cases the data 
can also be preprocessed to make it fit into memory, e.g., by filtering to the 
interesting bits of the data, downsampling, or simply projecting the input data 
down to a single column.

Other differences to purely lazy libraries are that data frames require 
operations like sort or groupby, which require some sort of internal 
persistence. For now I'm just using an implementation which falls back on 
storing the entire data in memory, but I hope that I can add some spill-to-disk 
features later.

Re: Designing a data frame API

Reply via email to