Re: dataframe implementations

2015-12-03 Thread Jay Norwood via Digitalmars-d-learn
On Saturday, 21 November 2015 at 14:16:26 UTC, Laeeth Isharc 
wrote:


Not sure it is a great idea to use a variant as the basic 
option when very often you will know that every cell in a 
particular column will be of the same type.



I'm reading today about an n-dim extension to pandas named xray.  
Maybe should try to understand how that fits.  They support io 
from netCDF, and are making extensions to support blocked input 
using dask, so they can process data larger than in-memory limits.


http://xray.readthedocs.org/en/stable/data-structures.html
https://www.continuum.io/content/xray-dask-out-core-labeled-arrays-python


In general, pandas and xray are supporting with the requirement 
of pulling in data from storage of initially unknown column and 
index names and data types.  Julia throws in support of jit 
compilation and specialized operations for different data types.


It seems to me that D's strength would be in a quick compile, 
which would then allow you to replace the dictionary tag 
implementations and variants with something that used compile 
time symbol names and data types. Seems like that would provide 
more efficient processing, as well as better tab completion 
support when creating expressions.




Re: dataframe implementations

2015-11-21 Thread Laeeth Isharc via Digitalmars-d-learn

On Thursday, 19 November 2015 at 22:14:01 UTC, ZombineDev wrote:
On Thursday, 19 November 2015 at 06:33:06 UTC, Jay Norwood 
wrote:

On Wednesday, 18 November 2015 at 22:46:01 UTC, jmh530 wrote:
My sense is that any data frame implementation should try to 
build on the work that's being done with n-dimensional slices.


I've been watching that development, but I don't have a feel 
for where it could be applied in this case, since it appears 
to be focused on multi-dimensional slices of the same data 
type, slicing up a single range.


The dataframes often consist of different data types by column.

How did you see the nd slices being used?

Maybe the nd slices could be applied if you considered each 
row to be the same structure, and slice by rows rather than 
operating on columns.  Pandas supports a multi-dimension panel.

 Maybe this would be the application for nd slices by row.


How about using a nd slice of Variant(s), or a more specialized 
type Algebraic type?


[1]: http://dlang.org/phobos/std_variant


Not sure it is a great idea to use a variant as the basic option 
when very often you will know that every cell in a particular 
column will be of the same type.




Re: dataframe implementations

2015-11-20 Thread jmh530 via Digitalmars-d-learn

On Thursday, 19 November 2015 at 06:33:06 UTC, Jay Norwood wrote:


Maybe the nd slices could be applied if you considered each row 
to be the same structure, and slice by rows rather than 
operating on columns.  Pandas supports a multi-dimension panel.

 Maybe this would be the application for nd slices by row.


I meant in the sense that Pandas is built upon Numpy.


Re: dataframe implementations

2015-11-19 Thread John Colvin via Digitalmars-d-learn

On Thursday, 19 November 2015 at 06:33:06 UTC, Jay Norwood wrote:

On Wednesday, 18 November 2015 at 22:46:01 UTC, jmh530 wrote:
My sense is that any data frame implementation should try to 
build on the work that's being done with n-dimensional slices.


I've been watching that development, but I don't have a feel 
for where it could be applied in this case, since it appears to 
be focused on multi-dimensional slices of the same data type, 
slicing up a single range.


The dataframes often consist of different data types by column.

How did you see the nd slices being used?

Maybe the nd slices could be applied if you considered each row 
to be the same structure, and slice by rows rather than 
operating on columns.  Pandas supports a multi-dimension panel.

 Maybe this would be the application for nd slices by row.


You might not build on the nd slice type itself, but implementing 
the same API (where possible/appropriate) would be good.


Re: dataframe implementations

2015-11-19 Thread ZombineDev via Digitalmars-d-learn

On Thursday, 19 November 2015 at 06:33:06 UTC, Jay Norwood wrote:

On Wednesday, 18 November 2015 at 22:46:01 UTC, jmh530 wrote:
My sense is that any data frame implementation should try to 
build on the work that's being done with n-dimensional slices.


I've been watching that development, but I don't have a feel 
for where it could be applied in this case, since it appears to 
be focused on multi-dimensional slices of the same data type, 
slicing up a single range.


The dataframes often consist of different data types by column.

How did you see the nd slices being used?

Maybe the nd slices could be applied if you considered each row 
to be the same structure, and slice by rows rather than 
operating on columns.  Pandas supports a multi-dimension panel.

 Maybe this would be the application for nd slices by row.


How about using a nd slice of Variant(s), or a more specialized 
type Algebraic type?


[1]: http://dlang.org/phobos/std_variant


Re: dataframe implementations

2015-11-18 Thread Laeeth Isharc via Digitalmars-d-learn

On Monday, 2 November 2015 at 13:54:09 UTC, Jay Norwood wrote:
I was reading about the Julia dataframe implementation 
yesterday, trying to understand their decisions and how D might 
implement.


From my notes,
1. they are currently using a dictionary of column vectors.
2. for NA (not available) they are currently using an array of 
bytes, effectively as a Boolean flag, rather than a bitVector, 
for performance reasons.

3. they are not currently implementing hierarchical headers.
4. they are transforming non-valid symbol header strings (read 
from csv, for example) to valid symbols by replacing '.' with 
underscore and prefixing numbers with 'x', as examples.  This 
allows use in expressions.
5. Along with 4., they currently have @with for DataVector, to 
allow expressions to use, for example, :symbol_name instead of 
dv[:symbol_name].
6. They have operation symbols for per element operations on 
two vectors, for example a ./ b expresses applying the 
operation to the vector.
7. They currently only have row indexes,  no row names or 
symbols.


I saw someone posting that they were working on DataFrame 
implementation here, but haven't been able to locate any code 
in github, and was wondering what implementation decisions are 
being made here.  Thanks.


What do you think about the use of NaN for missing floats?  In 
theory I could imagine wanting to distinguish between an NaN in 
the source file and a missing value, but in my world I never felt 
the need for this.  For integers and bools, that is different of 
course.




Re: dataframe implementations

2015-11-18 Thread Laeeth Isharc via Digitalmars-d-learn

On Tuesday, 17 November 2015 at 13:56:14 UTC, Jay Norwood wrote:

I looked through the dataframe code and a couple of comments...

I had thought perhaps an app could read in the header info and 
type info from hdf5, and generate D struct definitions with 
column headers as symbol names.  That would enable faster 
processing than with the associative arrays, as well as support 
the auto-completion that would be helpful in writing 
expressions.


Yes - I think that one will want to have a choice between this 
kind of approach and using associative arrays.  Because for some 
purposes it's not convenient to have to compile code every time 
you open a strange file, and on the other hand the hit with an AA 
sometimes will matter.


The situation at the moment for me is that I have very little 
time to work on a correct general solution for this problem 
myself (yet its important for D that we do get to one).  I also 
lack the experience with D to do it very well very quickly.  I do 
have a couple of seasoned people from the community helping me 
with things, but dataframes won't be the first thing they look 
at, and it could be a while before we get to that.  If we 
implement for our own needs,then I will open source it as it is 
commercially sensible as well as the right thing to do.  But that 
could be a year away.


Vlad Levenfeld was also looking at this a bit.


The csv type info for columns could be inferred, or else stated 
in the reader call, as done as an option in julia.


In both cases the column names would have to be valid symbol 
names for this to work.  I believe Julia also expects this, or 
else does some conversion on your column names to make them 
valid symbols. I think the D csv processing would also need to 
check if the


The jupyter interactive environment supports python pandas and 
Julia dataframe column names in the autocompletion, and so I 
think the D debugging environment would need to provide similar 
capability if it is to be considered as a fast-recompile 
substitute for interactive dataframe exploration.


Well we don't need to get there in a single bound - already just 
being able to do this at all is a big improvement, and I am 
already using D with jupyter to do things.


It seems to me that your particular examples of stock data 
would eventually need to handle missing data, as supported in 
Julia dataframes and python pandas.  They both provide ways to 
drop or fill missing values.  Did you want to support that?
Yes - we should do so eventually, and there's much more that 
could be done.  But maybe a sensible basic implementation is a 
start and we can refine after that.


I wrote the dataframe in a couple of evenings, so I am sure it 
can be improved, and even rearchitected.  Pull requests welcomed, 
and maybe we should set up a Trello to organise ideas ?  Let me 
know if you are in.




Re: dataframe implementations

2015-11-18 Thread Jay Norwood via Digitalmars-d-learn
On Wednesday, 18 November 2015 at 17:15:38 UTC, Laeeth Isharc 
wrote:
What do you think about the use of NaN for missing floats?  In 
theory I could imagine wanting to distinguish between an NaN in 
the source file and a missing value, but in my world I never 
felt the need for this.  For integers and bools, that is 
different of course.


The julia discussions mention another dataframe implementation, I 
believe it was for R, where NaN was used.  There was some mention 
of the virtues of their own choice and the problems with NaN.  I 
think use of NaN was a particular encoding of NaN.  Other 
implementations they mentioned used some reserved value in each 
of the numeric data types to represent NA.  In the julia case, I 
believe what they use is a separate byte vector for each column 
that holds the NA status.  They discussed some other possible 
enhancements, but I don't know what they implemented.  For 
example, if the single byte holds the NA flag, the cell value can 
hold additional info ... maybe the reason for the NA.  There was 
also some discussion of having the associated cell hold repeat 
counts for the NA status, which I suppose meant to repeat it for 
following cells in the column vector.  I'll try to find the 
discussions and post the link.





Re: dataframe implementations

2015-11-18 Thread Jay Norwood via Digitalmars-d-learn

On Wednesday, 18 November 2015 at 18:04:30 UTC, Jay Norwood wrote:

vector.  I'll try to find the discussions and post the link.


Here are the two discussions I recall on the julia NA 
implementation.


http://wizardmac.tumblr.com/post/104019606584/whats-wrong-with-statistics-in-julia-a-reply
https://github.com/JuliaLang/julia/pull/9363




Re: dataframe implementations

2015-11-18 Thread Jay Norwood via Digitalmars-d-learn

On Wednesday, 18 November 2015 at 22:46:01 UTC, jmh530 wrote:
My sense is that any data frame implementation should try to 
build on the work that's being done with n-dimensional slices.


I've been watching that development, but I don't have a feel for 
where it could be applied in this case, since it appears to be 
focused on multi-dimensional slices of the same data type, 
slicing up a single range.


The dataframes often consist of different data types by column.

How did you see the nd slices being used?

Maybe the nd slices could be applied if you considered each row 
to be the same structure, and slice by rows rather than operating 
on columns.  Pandas supports a multi-dimension panel.  Maybe this 
would be the application for nd slices by row.





Re: dataframe implementations

2015-11-18 Thread jmh530 via Digitalmars-d-learn

On Monday, 2 November 2015 at 13:54:09 UTC, Jay Norwood wrote:


I saw someone posting that they were working on DataFrame 
implementation here, but haven't been able to locate any code 
in github, and was wondering what implementation decisions are 
being made here.  Thanks.


My sense is that any data frame implementation should try to 
build on the work that's being done with n-dimensional slices.





Re: dataframe implementations

2015-11-18 Thread Jay Norwood via Digitalmars-d-learn
One more discussion link on the NA subject. This one on the R 
implementation of NA using a single encoding of NaN, as well as 
their treatment of a selected integer value as a NA.


http://rsnippets.blogspot.com/2013/12/gnu-r-vs-julia-is-it-only-matter-of.html



Re: dataframe implementations

2015-11-17 Thread Jay Norwood via Digitalmars-d-learn

I looked through the dataframe code and a couple of comments...

I had thought perhaps an app could read in the header info and 
type info from hdf5, and generate D struct definitions with 
column headers as symbol names.  That would enable faster 
processing than with the associative arrays, as well as support 
the auto-completion that would be helpful in writing expressions.


The csv type info for columns could be inferred, or else stated 
in the reader call, as done as an option in julia.


In both cases the column names would have to be valid symbol 
names for this to work.  I believe Julia also expects this, or 
else does some conversion on your column names to make them valid 
symbols. I think the D csv processing would also need to check if 
the


The jupyter interactive environment supports python pandas and 
Julia dataframe column names in the autocompletion, and so I 
think the D debugging environment would need to provide similar 
capability if it is to be considered as a fast-recompile 
substitute for interactive dataframe exploration.


It seems to me that your particular examples of stock data would 
eventually need to handle missing data, as supported in Julia 
dataframes and python pandas.  They both provide ways to drop or 
fill missing values.  Did you want to support that?











Re: dataframe implementations

2015-11-02 Thread Laeeth Isharc via Digitalmars-d-learn

On Monday, 2 November 2015 at 13:54:09 UTC, Jay Norwood wrote:
I was reading about the Julia dataframe implementation 
yesterday, trying to understand their decisions and how D might 
implement.


From my notes,
1. they are currently using a dictionary of column vectors.
2. for NA (not available) they are currently using an array of 
bytes, effectively as a Boolean flag, rather than a bitVector, 
for performance reasons.

3. they are not currently implementing hierarchical headers.
4. they are transforming non-valid symbol header strings (read 
from csv, for example) to valid symbols by replacing '.' with 
underscore and prefixing numbers with 'x', as examples.  This 
allows use in expressions.
5. Along with 4., they currently have @with for DataVector, to 
allow expressions to use, for example, :symbol_name instead of 
dv[:symbol_name].
6. They have operation symbols for per element operations on 
two vectors, for example a ./ b expresses applying the 
operation to the vector.
7. They currently only have row indexes,  no row names or 
symbols.


I saw someone posting that they were working on DataFrame 
implementation here, but haven't been able to locate any code 
in github, and was wondering what implementation decisions are 
being made here.  Thanks.


Hi Jay.

That may have been me.  I have implemented something very basic, 
but you can read and write my proto dataframe to/from CSV and 
HDF5.  The code is up here:


https://github.com/Laeeth/d_dataframes

You should think of it as a crude prototype that nonetheless has 
been useful for me, but it's done more in the old school hacker 
spirit of getting something working first rather than being 
designed properly.  The reason for that is I have a lot on my 
plate at the moment, and technology is only one of many of these, 
although an important one.  In time I may get someone else to 
work on dataframes and opensource the results, but that may be 
some months away.


So I'd welcome any assistance, or even taking it over.  I haven't 
really done a good job of having idiomatic access, but it's 
something and a start.



Laeeth.


I


Re: dataframe implementations

2015-11-02 Thread Jay Norwood via Digitalmars-d-learn

On Monday, 2 November 2015 at 15:33:34 UTC, Laeeth Isharc wrote:

Hi Jay.

That may have been me.  I have implemented something very 
basic, but you can read and write my proto dataframe to/from 
CSV and HDF5.  The code is up here:


https://github.com/Laeeth/d_dataframes



yes, thanks.  I believe I did see your comments previously.
That's great that you've already got support for hdf5. I'll take 
a look.