Re: [Haskell-cafe] Mathematics and Statistics libraries

Carter Tazio Schonwald Thu, 29 Mar 2012 11:08:05 -0700

Hey All,

Theres actually a number of issues the come up with an effective dataframe-like 
for haskell, and data vis as well.  (both of which I have some strong personal 
opinions on for haskell and which I'm exploring / experimenting with this 
spring). While folks have touched on a bunch, I just thought I'd put together 
my own opinions in the mix.

First of all: any good data manipulation (i.e. data frame -like ) library needs 
support for efficiently querying subsets of the data in various ways. Not just 
that,  it really should provide coherent way of dealing with out of core data! 
From there you might want to ask the question: "do I want to iterate through 
chunks of the data" or "do i want to allow more general patterns of data 
access, and perhaps even ways to parallelize?". The basic thing (as others have 
remarked after this draft email got underway), you do essentially want to 
support some sql-like selection operations, and have them be efficient too, 
along with playing nice with columns of differing types 

What sort of abstractions you provide are somewhat crucial, because that in 
turn affects how you can write algorithms! If you look closely, this is 
tantamount to saying that any sufficiently well designed (industrial grade) 
data frame lib for haskell might wind up leading into a model for supporting 
mapreduce or graphlab http://graphlab.org/ style algorithms in the multicore / 
not distributed regime, though a first version would pragmatically just provide 
an interface with sequentially chunked data and use pipes-core, or one of the 
other enumerator libraries. Theres also some need for the aforementioned fancy 
types for managing data, but that not even the real challenge (in my opinion). 
Probably the best lib to take ideas from is the python Pandas library, or at 
least thats my personal opinion. 

Now in the space of data vis, probably the best example of a good library in 
terms of easy of getting informative (and pretty) outputs is ggplot2 (also in 
R). Now if you look there, you'll see that its VERY much integrated with the 
model fitting and data analysis functionality of R, and has a very 
compositional approach  which could easily be ported pretty directly over to 
haskell. 
However, as with a good data frame-like, certain obstacles come up partly 
because if we insist a type safe way to do things while being at least as high 
level as R or python, the absence of row types for frame column names makes 
specifying linear models that are statically well formed  (as in only 
referencing column names that are actually in the underlying data frame) bit 
tricky, and while there are approaches that do work some of the time,  theres 
not really a good general purpose way (as far as I can tell) for that small 
problem of trying to resolve names as early as possible. Or at the very least I 
don't see a simple approach that i'm happy with.

these can be summarized I think as follows:
Any "practical" data frame lib needs to interact well with out of core data, 
and ideally also simplify the task of writing algorithms on top in a way that 
sort of gives out of core goodness for free. Theres a lot of different ways 
this can be perhaps done under the covers, perhaps using one of the libraries 
like reducers, enumerator or pipes core, but it really should be invisible for 
the client algorithms author, or at least invisible by default. And more over I 
think any attack in that direction is essentially a precursor to sorting out 
map-reduce and graph lab like tools for haskell.
Any really nice high level data vis tool really needs to have some data 
analysis / machine  learning style library that its working with, and this is 
probably best understood by looking at things already out there, such as 
ggplot2 in R

that said, I'm all ears for other folks takes on this, especially since I'm 
spending some time this spring experimenting in both these directions.

cheers
-Carter

On Sun, Mar 25, 2012 at 9:54 AM, Aleksey Khudyakov <[email protected] 
(mailto:[email protected])> wrote:
> On 25.03.2012 14 (tel:25.03.2012%2014):52, Tom Doris wrote:
> > Hi Heinrich,
> > 
> > If we compare the GHCi experience with R or IPython, leaving aside any
> > GUIs, the help system they have at the repl level is just a lot more
> > intuitive and easy to use, and you get access to the full manual
> > entries. For example, compare what you see if you type :info sort into
> > GHCi versus ?sort in R. R gives you a view of the full docs for the
> > function, whereas in GHCi you just get the type signature.
> > 
> Ingrating haddock documentation into GHCi would be really helpful but it's 
> GSoC project on its own.
> 
> For me most important difference between R's repl and GHCi is that :reload 
> wipes all local binding. Effectively it forces to write everything in file 
> and to avoid doing anything which couldn't be fitted into one-liner. It may 
> not be bad but it's definitely different style
> 
> And of course data visualization. Only library I know of is Chart[1] but I 
> don't like API much.
> 
> I think talking about data frames is a bit pointless unless we specify what 
> is data frame. Basically there are two representations of tabular data 
> structure: array of tuples or tuple of arrays. If you want first go for 
> Data.Vector.Vector YourData. If you want second you'll probably end up with 
> some HList-like data structure to hold arrays.
> 
> 
> 
> [1] http://hackage.haskell.org/package/Chart
> 
> 
> _______________________________________________
> Haskell-Cafe mailing list
> [email protected] (mailto:[email protected])
> http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________
Haskell-Cafe mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Mathematics and Statistics libraries

Reply via email to