Re: dataframe implementations
On Saturday, 21 November 2015 at 14:16:26 UTC, Laeeth Isharc wrote: Not sure it is a great idea to use a variant as the basic option when very often you will know that every cell in a particular column will be of the same type. I'm reading today about an n-dim extension to pandas named xray. Maybe should try to understand how that fits. They support io from netCDF, and are making extensions to support blocked input using dask, so they can process data larger than in-memory limits. http://xray.readthedocs.org/en/stable/data-structures.html https://www.continuum.io/content/xray-dask-out-core-labeled-arrays-python In general, pandas and xray are supporting with the requirement of pulling in data from storage of initially unknown column and index names and data types. Julia throws in support of jit compilation and specialized operations for different data types. It seems to me that D's strength would be in a quick compile, which would then allow you to replace the dictionary tag implementations and variants with something that used compile time symbol names and data types. Seems like that would provide more efficient processing, as well as better tab completion support when creating expressions.
Re: dataframe implementations
On Thursday, 19 November 2015 at 22:14:01 UTC, ZombineDev wrote: On Thursday, 19 November 2015 at 06:33:06 UTC, Jay Norwood wrote: On Wednesday, 18 November 2015 at 22:46:01 UTC, jmh530 wrote: My sense is that any data frame implementation should try to build on the work that's being done with n-dimensional slices. I've been watching that development, but I don't have a feel for where it could be applied in this case, since it appears to be focused on multi-dimensional slices of the same data type, slicing up a single range. The dataframes often consist of different data types by column. How did you see the nd slices being used? Maybe the nd slices could be applied if you considered each row to be the same structure, and slice by rows rather than operating on columns. Pandas supports a multi-dimension panel. Maybe this would be the application for nd slices by row. How about using a nd slice of Variant(s), or a more specialized type Algebraic type? [1]: http://dlang.org/phobos/std_variant Not sure it is a great idea to use a variant as the basic option when very often you will know that every cell in a particular column will be of the same type.
Re: dataframe implementations
On Thursday, 19 November 2015 at 06:33:06 UTC, Jay Norwood wrote: Maybe the nd slices could be applied if you considered each row to be the same structure, and slice by rows rather than operating on columns. Pandas supports a multi-dimension panel. Maybe this would be the application for nd slices by row. I meant in the sense that Pandas is built upon Numpy.
Re: dataframe implementations
On Thursday, 19 November 2015 at 06:33:06 UTC, Jay Norwood wrote: On Wednesday, 18 November 2015 at 22:46:01 UTC, jmh530 wrote: My sense is that any data frame implementation should try to build on the work that's being done with n-dimensional slices. I've been watching that development, but I don't have a feel for where it could be applied in this case, since it appears to be focused on multi-dimensional slices of the same data type, slicing up a single range. The dataframes often consist of different data types by column. How did you see the nd slices being used? Maybe the nd slices could be applied if you considered each row to be the same structure, and slice by rows rather than operating on columns. Pandas supports a multi-dimension panel. Maybe this would be the application for nd slices by row. You might not build on the nd slice type itself, but implementing the same API (where possible/appropriate) would be good.
Re: dataframe implementations
On Thursday, 19 November 2015 at 06:33:06 UTC, Jay Norwood wrote: On Wednesday, 18 November 2015 at 22:46:01 UTC, jmh530 wrote: My sense is that any data frame implementation should try to build on the work that's being done with n-dimensional slices. I've been watching that development, but I don't have a feel for where it could be applied in this case, since it appears to be focused on multi-dimensional slices of the same data type, slicing up a single range. The dataframes often consist of different data types by column. How did you see the nd slices being used? Maybe the nd slices could be applied if you considered each row to be the same structure, and slice by rows rather than operating on columns. Pandas supports a multi-dimension panel. Maybe this would be the application for nd slices by row. How about using a nd slice of Variant(s), or a more specialized type Algebraic type? [1]: http://dlang.org/phobos/std_variant
Re: dataframe implementations
On Monday, 2 November 2015 at 13:54:09 UTC, Jay Norwood wrote: I was reading about the Julia dataframe implementation yesterday, trying to understand their decisions and how D might implement. From my notes, 1. they are currently using a dictionary of column vectors. 2. for NA (not available) they are currently using an array of bytes, effectively as a Boolean flag, rather than a bitVector, for performance reasons. 3. they are not currently implementing hierarchical headers. 4. they are transforming non-valid symbol header strings (read from csv, for example) to valid symbols by replacing '.' with underscore and prefixing numbers with 'x', as examples. This allows use in expressions. 5. Along with 4., they currently have @with for DataVector, to allow expressions to use, for example, :symbol_name instead of dv[:symbol_name]. 6. They have operation symbols for per element operations on two vectors, for example a ./ b expresses applying the operation to the vector. 7. They currently only have row indexes, no row names or symbols. I saw someone posting that they were working on DataFrame implementation here, but haven't been able to locate any code in github, and was wondering what implementation decisions are being made here. Thanks. What do you think about the use of NaN for missing floats? In theory I could imagine wanting to distinguish between an NaN in the source file and a missing value, but in my world I never felt the need for this. For integers and bools, that is different of course.
Re: dataframe implementations
On Tuesday, 17 November 2015 at 13:56:14 UTC, Jay Norwood wrote: I looked through the dataframe code and a couple of comments... I had thought perhaps an app could read in the header info and type info from hdf5, and generate D struct definitions with column headers as symbol names. That would enable faster processing than with the associative arrays, as well as support the auto-completion that would be helpful in writing expressions. Yes - I think that one will want to have a choice between this kind of approach and using associative arrays. Because for some purposes it's not convenient to have to compile code every time you open a strange file, and on the other hand the hit with an AA sometimes will matter. The situation at the moment for me is that I have very little time to work on a correct general solution for this problem myself (yet its important for D that we do get to one). I also lack the experience with D to do it very well very quickly. I do have a couple of seasoned people from the community helping me with things, but dataframes won't be the first thing they look at, and it could be a while before we get to that. If we implement for our own needs,then I will open source it as it is commercially sensible as well as the right thing to do. But that could be a year away. Vlad Levenfeld was also looking at this a bit. The csv type info for columns could be inferred, or else stated in the reader call, as done as an option in julia. In both cases the column names would have to be valid symbol names for this to work. I believe Julia also expects this, or else does some conversion on your column names to make them valid symbols. I think the D csv processing would also need to check if the The jupyter interactive environment supports python pandas and Julia dataframe column names in the autocompletion, and so I think the D debugging environment would need to provide similar capability if it is to be considered as a fast-recompile substitute for interactive dataframe exploration. Well we don't need to get there in a single bound - already just being able to do this at all is a big improvement, and I am already using D with jupyter to do things. It seems to me that your particular examples of stock data would eventually need to handle missing data, as supported in Julia dataframes and python pandas. They both provide ways to drop or fill missing values. Did you want to support that? Yes - we should do so eventually, and there's much more that could be done. But maybe a sensible basic implementation is a start and we can refine after that. I wrote the dataframe in a couple of evenings, so I am sure it can be improved, and even rearchitected. Pull requests welcomed, and maybe we should set up a Trello to organise ideas ? Let me know if you are in.
Re: dataframe implementations
On Wednesday, 18 November 2015 at 17:15:38 UTC, Laeeth Isharc wrote: What do you think about the use of NaN for missing floats? In theory I could imagine wanting to distinguish between an NaN in the source file and a missing value, but in my world I never felt the need for this. For integers and bools, that is different of course. The julia discussions mention another dataframe implementation, I believe it was for R, where NaN was used. There was some mention of the virtues of their own choice and the problems with NaN. I think use of NaN was a particular encoding of NaN. Other implementations they mentioned used some reserved value in each of the numeric data types to represent NA. In the julia case, I believe what they use is a separate byte vector for each column that holds the NA status. They discussed some other possible enhancements, but I don't know what they implemented. For example, if the single byte holds the NA flag, the cell value can hold additional info ... maybe the reason for the NA. There was also some discussion of having the associated cell hold repeat counts for the NA status, which I suppose meant to repeat it for following cells in the column vector. I'll try to find the discussions and post the link.
Re: dataframe implementations
On Wednesday, 18 November 2015 at 18:04:30 UTC, Jay Norwood wrote: vector. I'll try to find the discussions and post the link. Here are the two discussions I recall on the julia NA implementation. http://wizardmac.tumblr.com/post/104019606584/whats-wrong-with-statistics-in-julia-a-reply https://github.com/JuliaLang/julia/pull/9363
Re: dataframe implementations
On Wednesday, 18 November 2015 at 22:46:01 UTC, jmh530 wrote: My sense is that any data frame implementation should try to build on the work that's being done with n-dimensional slices. I've been watching that development, but I don't have a feel for where it could be applied in this case, since it appears to be focused on multi-dimensional slices of the same data type, slicing up a single range. The dataframes often consist of different data types by column. How did you see the nd slices being used? Maybe the nd slices could be applied if you considered each row to be the same structure, and slice by rows rather than operating on columns. Pandas supports a multi-dimension panel. Maybe this would be the application for nd slices by row.
Re: dataframe implementations
On Monday, 2 November 2015 at 13:54:09 UTC, Jay Norwood wrote: I saw someone posting that they were working on DataFrame implementation here, but haven't been able to locate any code in github, and was wondering what implementation decisions are being made here. Thanks. My sense is that any data frame implementation should try to build on the work that's being done with n-dimensional slices.
Re: dataframe implementations
One more discussion link on the NA subject. This one on the R implementation of NA using a single encoding of NaN, as well as their treatment of a selected integer value as a NA. http://rsnippets.blogspot.com/2013/12/gnu-r-vs-julia-is-it-only-matter-of.html
Re: dataframe implementations
I looked through the dataframe code and a couple of comments... I had thought perhaps an app could read in the header info and type info from hdf5, and generate D struct definitions with column headers as symbol names. That would enable faster processing than with the associative arrays, as well as support the auto-completion that would be helpful in writing expressions. The csv type info for columns could be inferred, or else stated in the reader call, as done as an option in julia. In both cases the column names would have to be valid symbol names for this to work. I believe Julia also expects this, or else does some conversion on your column names to make them valid symbols. I think the D csv processing would also need to check if the The jupyter interactive environment supports python pandas and Julia dataframe column names in the autocompletion, and so I think the D debugging environment would need to provide similar capability if it is to be considered as a fast-recompile substitute for interactive dataframe exploration. It seems to me that your particular examples of stock data would eventually need to handle missing data, as supported in Julia dataframes and python pandas. They both provide ways to drop or fill missing values. Did you want to support that?
Re: dataframe implementations
On Monday, 2 November 2015 at 13:54:09 UTC, Jay Norwood wrote: I was reading about the Julia dataframe implementation yesterday, trying to understand their decisions and how D might implement. From my notes, 1. they are currently using a dictionary of column vectors. 2. for NA (not available) they are currently using an array of bytes, effectively as a Boolean flag, rather than a bitVector, for performance reasons. 3. they are not currently implementing hierarchical headers. 4. they are transforming non-valid symbol header strings (read from csv, for example) to valid symbols by replacing '.' with underscore and prefixing numbers with 'x', as examples. This allows use in expressions. 5. Along with 4., they currently have @with for DataVector, to allow expressions to use, for example, :symbol_name instead of dv[:symbol_name]. 6. They have operation symbols for per element operations on two vectors, for example a ./ b expresses applying the operation to the vector. 7. They currently only have row indexes, no row names or symbols. I saw someone posting that they were working on DataFrame implementation here, but haven't been able to locate any code in github, and was wondering what implementation decisions are being made here. Thanks. Hi Jay. That may have been me. I have implemented something very basic, but you can read and write my proto dataframe to/from CSV and HDF5. The code is up here: https://github.com/Laeeth/d_dataframes You should think of it as a crude prototype that nonetheless has been useful for me, but it's done more in the old school hacker spirit of getting something working first rather than being designed properly. The reason for that is I have a lot on my plate at the moment, and technology is only one of many of these, although an important one. In time I may get someone else to work on dataframes and opensource the results, but that may be some months away. So I'd welcome any assistance, or even taking it over. I haven't really done a good job of having idiomatic access, but it's something and a start. Laeeth. I
Re: dataframe implementations
On Monday, 2 November 2015 at 15:33:34 UTC, Laeeth Isharc wrote: Hi Jay. That may have been me. I have implemented something very basic, but you can read and write my proto dataframe to/from CSV and HDF5. The code is up here: https://github.com/Laeeth/d_dataframes yes, thanks. I believe I did see your comments previously. That's great that you've already got support for hdf5. I'll take a look.