Hi everyone,

The following are some suggestions. I'd love to help work on them. Sorry for
the length.

- Zaki Mughal

---

I announced adding data frames for PDL several months back
<http://comments.gmane.org/gmane.comp.lang.perl.pdl.general/8335> and my
intention to embed R in Perl. Embedding R in Perl is actually complete now and
just about ready for CPAN 
<https://github.com/zmughal/embedding-r-in-perl-experiment>
thanks to the help of the gang on #inline 
<http://inline.ouistreet.com/node/zfp7.html>.

In order to build the data frames and match R types, I created several
subclasses of PDL that handle a subset of PDL functions, but I haven't figured
out a way to wrap all of PDL's functionality systematically. I have several
thoughts on this.

## Levels of measurement

  When using R, one of the nice things it does is warn or give
  an error when you try to do an operation that would be invalid on a certain
  type of data. One such type of data is categorical data, which R calls
  factors and for which I made a subclass of PDL called PDL::Factor. Some of
  this behvaviour is inspired by the statistical methodology of levels of
  measurement <https://en.wikipedia.org/wiki/Level_of_measurement>. I believe
  SAS even explicitly allows assigning levels of measurment to variables.

  For example, if I try to apply the mean() function on all the columns of the
  Iris data set, I get this warning:

  ```r
  lapply( iris, mean )
  #> $Sepal.Length
  #> [1] 5.843333
  #>
  #> $Sepal.Width
  #> [1] 3.057333
  #>
  #> $Petal.Length
  #> [1] 3.758
  #>
  #> $Petal.Width
  #> [1] 1.199333
  #>
  #> $Species
  #> [1] NA
  #>
  #> Warning message:
  #> In mean.default(X[[5L]], ...) :
  #>   argument is not numeric or logical: returning NA
  ```

  `NA` is R's equivalent of `BAD` values. For `mean()` this makes sense for
  categorical data. For logical vectors, it does something else:

  ```r
  which_setosa <- iris$Species == 'setosa' # this is a logical
  mean( which_setosa )
  #> [1] 0.3333333
  ```

  This means 1/3 of the logical data was true which may be useful for `mean()`
  to return in that case.

  Thinking in terms of levels of measurement can help with another experiment
  I'm doing which based around tracking the units of measure used for numerical
  things in Perl. Code is here 
<https://github.com/zmughal/units-experiment/blob/master/overload_override.pl>.

  What I do there is use Moo roles to add a unit attribute to numerical types
  (Perl scalars, Number::Fraction, PDL, etc.) and whenever they go through an
  operation by either operator overloading or calling a function such as
  `sum()`, the unit will be carried along with it and be manipulated
  appropriately (you can take the mean of Kelvin, but not degrees Celsius). I
  know that units of measure are messy to implement, but being able to support
  auxiliary operations like this will go a long way to making PDL flexible.

  [Has anyone used udunits2? I made an Alien package for it. It's on CPAN.]

## DataShape and Blaze

  I think it would be beneficial to look at the work being done by the Blaze
  project <http://blaze.pydata.org/> with its DataShape specification
  <http://datashape.pydata.org/>. The idea behind it is to be able to use the
  various array-like APIs without having to worry what is going on in the
  backend — be it with a CPU-based, GPU-based, SciDB, or even a SQL server.

## Julia

  Julia has been doing some amazing things with how they've grown out their
  language. I was looking to see if they have anything similar to the dataflow
  in PDL and I came across ArrayViews 
<https://github.com/JuliaLang/ArrayViews.jl>.
  It may be enlightening to see how they compose this feature onto already
  existing n-d arrays as opposed to how PDL does it.

  I do not know what tradeoffs that brings, but it is a starting point to think
  about. I think similar approaches can be made to support sparse arrays.

  In fact, one of Julia's strengths is how they use multimethods to handle new
  types with ease. See "The Design Impact of Multiple Dispatch" 
<http://nbviewer.ipython.org/gist/StefanKarpinski/b8fe9dbb36c1427b9f22>
  for examples. [Perl 6 has built-in multimethods]

## MATLAB subclassing

  I use MATLAB daily. I came across this area of the documentation that talks
  about how to subclass. 
<http://www.mathworks.com/help/matlab/matlab_oop/subclassing-matlab-built-in-classes.html>

  Some of the information in there is good for knowing how *not* to implement
  things, but there is also some discussion on what is necessary for the
  storage types that might be worth looking at.

  [By the way, I have downloaded all of MATLAB File Central's code and I could 
do
  some analysis on the functions used there if that would be helpful.]

## GPU and threading

  I think it would be best to offload GPU support to other libraries, so it
  would be good to extract what is common between libraries like

  - MAGMA <http://icl.cs.utk.edu/magma/>,
  - ViennaCL <http://viennacl.sourceforge.net/>,
  - Blaze-lib  <https://code.google.com/p/blaze-lib/>,
  - VXL <http://vxl.sourceforge.net/>,
  - Spark <http://spark.apache.org/>,
  - Torch <http://torch.ch/>,
  - Theano <http://www.deeplearning.net/software/theano/>,
  - Eigen <http://eigen.tuxfamily.org/>, and
  - Armadillo <http://arma.sourceforge.net/>.

  Eigen is interesting in particular because it has support for storing in both
  row-major and column-major data 
<http://eigen.tuxfamily.org/dox-devel/group__TopicStorageOrders.html>.

  Another source of inspiration would be the VSIPL spec 
<http://www.omgwiki.org/hpec/vsipl>.
  It's a standard made for signal processing routines in the embedded DSP world
  and comes with "Core" and "Core Lite" profiles which might help decide what
  should be included in a smaller subset of PDL.

  Also in my wishlist is interoperability with libraries like ITK 
<http://www.itk.org/>,
  VTK <http://www.vtk.org/>, and yt <http://yt-project.org/>. They have
  interesting architectures especially for computation. Unfortunately, the
  first two are C++ based and I don't have experience with combining C++ and XS.

## Better testing

  PDL should make more guarantees about how types flow through the system. This
  might be accomplished by adding assertions in the style of Design-by-Contract
  which can act as both a testable spec and documentation. I'm working on the
  test suite right now on a branch and I hope to create a proof-of-concept of
  this soon.

  I hope that this can help make PDL more consistent and easily testable. There
  are still small inconsistencies that shouldn't be there which can be weeded 
out
  with testing. For example, what type is expected for this code? :

  ```perl
  use PDL;
  print stretcher( sequence(float, 3) )->type;
  ```

  I would expect 'float', but it is actually 'double' under PDL v2.007_04.

## Incremental computation

  I find that the way I grow my code is to slowly add modules that work
  together in a pipeline. Running and rerunning this code through all the
  modules is slow. To avoid that, I create multiple small programs that read
  and write files to pass from one script to the next. I was looking for a
  solution and came across IncPy <http://www.pgbovine.net/incpy.html>. It
  modifies the Python interpreter to support automatic persistent memoization.
  I don't think the idea has caught on, but I think it should and perhaps Perl
  and PDL is flexible enough to herald it as a CPAN module.


_______________________________________________
Perldl mailing list
Perldl@jach.hawaii.edu
http://mailman.jach.hawaii.edu/mailman/listinfo/perldl

Reply via email to