Hii, I think replacing boost spirit can be the base step for creating the new data frame. I believe that this is a great chance to implement this as we can even design the new CSV parser keeping in mind the new Dataframe class. As I went through this issue <https://github.com/mlpack/mlpack/issues/2646>, I can see that there are issues that we have some groundwork done but a major point still remains undecided, should we implement an in-house parser or use some already existing parse. The discussion in the issues talks about the advantages and disadvantages of both methods.
I think making an in-house parser(as a single header file) can be an option. I am currently going through both the options that Ryan mentioned in the issue: - https://github.com/ben-strasser/fast-cpp-csv-parser - https://github.com/d99kris/rapidcsv I am working on making a small prototype to load a basic CSV file considering we want to go with an in-house parser. I will update you by the end of the day regarding the prototype. Would love to hear your thoughts on whether to go with an already implemented parser or build a new one. Also if we are planning to build a data frame here then maybe going with an in-house parser would be better as we will have the ability to design it in such a way that it can extend maximum support to the new data frame which we are planning to build ahead. Thanks, Gopi On Sun, Mar 28, 2021 at 9:08 PM Omar Shrit <[email protected]> wrote: > Hello Gopi, > > On 03/28, Gopi Manohar Tatiraju wrote: > > Hey Omar, > > > > Sorry to disturb you again. > > Did you saw the last mail by Ryan? > > How shall we proceed after that input? > > Yes of course, knowing this information will provide you with one step > ahead. > > Also, if you want to proceed with what Ryan said, you should know that > replacing boost spirit is itself a GSoC project, and you may want to > focus on it for your proposal if this what you want to do. > > Let me know what do you think. > > > Also, I have some more thoughts regarding the idea. > > I think currently we can divide the datasets we generally use into 3 or 4 > > categories: > > > > - Purely Numeric > > - Numeric + Categorical(include strings) > > - NLP datasets > > - Images > > > > And we already have many functionalities already implemented to use with > > different types of datasets. > > > > *How should we use this dataframe? * > > So currently, our DatasetMapper is doing somewhat what we are planning to > > do. I think we should follow the same > > convention for this dataframe as well. While loading the dataset just as > we > > are using the DatasetMapper object we > > can also create a Dataframe object and populate it with the needed > > properties which will depend on the type of data > > we are obtaining. Now I have a doubt here > > > > What I said above will create certainly two things, one the *arma::mat* > > itself which the use creates and then the > > *Dataframe object*. What we can also do is write a new load method that > > wraps arma::mat and the new data frame > > utility under one roof and we don't have to deal with 2 different > objects. > > > > > > *What I am planning to do?* > > Currently, we have many things that are already implemented that we can > > use, functionalities like *one_hot_encoding* > > or *binarizing* a whole column of the dataset based on a condition. We > > should wrap these features in this data frame like > > class. What I suggest it, for each type of dataset let's pick what we > > already have and make it easier for the user to > > access this method. I can see that we have methods regarding all the > > dataset categories that I've mentioned above. > > > > Keeping in mind the timeline of GSoC this time, I think we should > > concentrate on using the already implemented > > functionalities and make a nice API for users to take advantage of these. > > We can edit the examples and show basic > > data-processing using this new class. Once all the already implemented > > functionalities are utilized then we can go > > for implementing the new ones. > > > > Sorry for such a long mail, but I think these are some points that we > > should ponder about to go ahead with the idea. > > > > Egar to hear comments and suggestions. > > > > Thank you. > > Regards, > > Gopi > > > > On Tue, Mar 23, 2021 at 1:08 AM Gopi Manohar Tatiraju < > [email protected]> > > wrote: > > > > > Hey Ryan, > > > > > > Thanks for the feedback, I too looked at that issue you mentioned but > > > somehow I overlooked it and didn't mention it here. > > > > > > If that's the case then do you have any suggestions regarding how to > make > > > mlpack header only? > > > As pointed out by Omar the data load/save part from mlpack core is the > > > only part that has implementation files (.cpp) > > > while all methods of mlpack are header-only. So we were planning to > remove > > > these .cpp files. Currently, there are: > > > > > > 1. detect_file_type.cpp > > > 2. load.cpp > > > 3. load_csv.cpp > > > 4. load_image.cpp > > > 5. save_image.cpp > > > > > > Currently, these are .cpp files we have and we want to change them to > .hpp > > > to make mlpack header only. > > > All this is kinda pre-work for implementing data frame like class for > > > mlpack, which I've explained above. > > > > > > Thanks for the suggestions > > > Gopi > > > > > > On Mon, Mar 22, 2021 at 9:58 PM Ryan Curtin <[email protected]> wrote: > > > > > >> (I have not read the rest of the thread.) > > >> > > >> No need to check the compilation time---I can already tell you it will > > >> be disastrously large. We can't put load_csv.cpp into load_csv.hpp, > as > > >> we would then be including the boost::spirit headers in everything > that > > >> included mlpack, which we really need to avoid (because boost::spirit > is > > >> huge). The idea is to force templates to be instantiated in > > >> load_csv.cpp, so that they get compiled into libmlpack.so. > > >> > > >> Really we need to replace our custom CSV parser with something else so > > >> that we no longer have a boost::spirit dependency. See here: > > >> > > >> https://github.com/mlpack/mlpack/issues/2646 > > >> > > >> Thanks, > > >> > > >> Ryan > > >> > > >> On Mon, Mar 22, 2021 at 09:10:47PM +0530, Gopi Manohar Tatiraju wrote: > > >> > Hey, > > >> > > > >> > Thank you for your quick response. > > >> > > > >> > So I can see that we have boost::spirit in load_csv.cpp > > >> > > > >> > Currently we have 2 files for load_csv: > > >> > > > >> > 1. load_csv.hpp > > >> > 2. load_csv.cpp > > >> > > > >> > I will move the function implementations for .cpp file to .hpp file > and > > >> > will declare them as inline. So we will only have one .hpp file for > > >> > load_csv. > > >> > > > >> > I will also check the compilation time before making these changes. > > >> > > > >> > Sounds good? > > >> > > > >> > Thanks, > > >> > Gopi > > >> > > > >> > On Mon, Mar 22, 2021 at 8:30 PM Omar Shrit <[email protected]> wrote: > > >> > > > >> > > On 03/22, Gopi Manohar Tatiraju wrote: > > >> > > > Hey Omar, > > >> > > > > > >> > > > Thank you for the reply... > > >> > > > So what I understood is to make files _impl.hpp we need to > declare > > >> class > > >> > > > templates. > > >> > > > > > >> > > > But taking the example of *detect_file *we can see that we only > have > > >> > > > functions declared here. One file for declaring the functions > and > > >> one for > > >> > > > the implementation. > > >> > > > > > >> > > > So maybe, I need to structure these functions into classes, > right? > > >> Coz if > > >> > > > we want to put functions directly into the header files we need > to > > >> > > declare > > >> > > > them as inline to avoid the violation of C++'s one definition > rule. > > >> > > > > >> > > inline should be fine for now. > > >> > > > > >> > > > Also, how should I compare the compilation time? > > >> > > > > >> > > Use the time command with make in order to get the compilation > time > > >> > > > > >> > > > Also about the prototype and benchmark you mentioned, can you > > >> explain a > > >> > > bit > > >> > > > more about that. I really want to make this project work and I > am > > >> willing > > >> > > > to put my time in learning new things. > > >> > > > > >> > > You can start by moving class implementation (those using boost > > >> spirit) to > > >> > > header files > > >> > > and recompile mlpack. You need to measure the compilation time > before > > >> > > and after changing implementation into header, that will gives us > an > > >> > > idea about the increase related to boost spirit > > >> > > > > >> > > > Thank you for the guidance. > > >> > > > > > >> > > > PS. Sorry, I was busy with my college project so it took some > time > > >> to > > >> > > reply. > > >> > > > > > >> > > > Regards, > > >> > > > Gopi > > >> > > > > > >> > > > On Fri, Mar 19, 2021 at 12:42 AM Omar Shrit <[email protected]> > wrote: > > >> > > > > > >> > > > > Hello Gopi, > > >> > > > > > > >> > > > > Sorry, I am forgot to answer you earlier (A little bit busy), > > >> thanks > > >> > > for > > >> > > > > the ping. > > >> > > > > > > >> > > > > On 03/18, Gopi Manohar Tatiraju wrote: > > >> > > > > > Hey Omar, > > >> > > > > > > > >> > > > > > Thank you soo much for such detailed replies and inputs. > > >> > > > > > > > >> > > > > > I see that there are many things that we need to get in > order > > >> before > > >> > > we > > >> > > > > > start working on this idea. > > >> > > > > > I think we should start from this point: > > >> > > > > > > > >> > > > > > 1) It would be nicer to have mlpack as header-only by moving > > >> these > > >> > > > > > > implementations into header files. > > >> > > > > > > > >> > > > > > > > >> > > > > > I went through the mlpack/core/data/ directory and List of > .cpp > > >> > > files in > > >> > > > > > core/data > > >> > > > > > > > >> > > > > > 1. detect_file_type.cpp > > >> > > > > > 2. load.cpp > > >> > > > > > 3. load_csv.cpp > > >> > > > > > 4. load_image.cpp > > >> > > > > > 5. save_image.cpp > > >> > > > > > > > >> > > > > > The convention we follow in mlpack is having two .hpp files, > > >> one for > > >> > > > > > declarations and one for implementations. > > >> > > > > > > >> > > > > The convention is related to classes that are defined as > > >> > > > > class-template. I do not think it is possible to convert an > > >> > > implementation > > >> > > > > file to `_impl.hpp` if the class is not a class template. > > >> > > > > > > >> > > > > Therefore, you can put the entire implementation in the > header, > > >> where > > >> > > > > the class is defined. There are several disadvantages of > putting > > >> > > everything > > >> > > > > in one place, for instance, increase compilation time and > binary > > >> > > footprint. > > >> > > > > > > >> > > > > It would be nice to have a small prototype and a benchmark to > see > > >> how > > >> > > > > much time the compilation time will increase, since we are > trying > > >> to > > >> > > > > reduce these numbers. > > >> > > > > > > >> > > > > I did not look into the code yet, but I do not see anything > that > > >> might > > >> > > > > prevent from having everything in the headers. > > >> > > > > > > >> > > > > Also, do not worry about details such as convention, these > can be > > >> seen > > >> > > > > later especially if a prototype comes back with good results. > > >> > > > > > > >> > > > > > I just want to clarify some things so that I can start > writing > > >> some > > >> > > code > > >> > > > > > regarding this: > > >> > > > > > > > >> > > > > > - Let's consider detect_file_type, we have for now: > > >> > > > > > - detect_file_type.hpp > > >> > > > > > < > > >> > > > > > > >> > > > > >> > https://github.com/mlpack/mlpack/blob/master/src/mlpack/core/data/detect_file_type.hpp > > >> > > > > > > > >> > > > > > - detect_file_type.cpp > > >> > > > > > < > > >> > > > > > > >> > > > > >> > https://github.com/mlpack/mlpack/blob/master/src/mlpack/core/data/detect_file_type.cpp > > >> > > > > > > > >> > > > > > > > >> > > > > > What we want in the future: > > >> > > > > > > > >> > > > > > > > >> > > > > > - detect_file_type.hpp > > >> > > > > > - detect_file_type_impl.hpp > > >> > > > > > > > >> > > > > > Can you just give me one example or reference or just a > list of > > >> > > points > > >> > > > > that > > >> > > > > > we need to consider while restructuring these files? What > will > > >> be the > > >> > > > > major > > >> > > > > > differences after converting them to .hpp files? > > >> > > > > > > > >> > > > > > > > >> > > > > > It feels great getting feedback on my idea and I get to > learn > > >> soo > > >> > > many > > >> > > > > new > > >> > > > > > things on the way. > > >> > > > > > > > >> > > > > > Thank you, > > >> > > > > > Gopi. > > >> > > > > > > > >> > > > > > > > >> > > > > > On Sat, Mar 13, 2021 at 4:12 AM Omar Shrit <[email protected]> > > >> wrote: > > >> > > > > > > > >> > > > > > > Hello Gopi, > > >> > > > > > > > > >> > > > > > > The data frame class project is indeed a good idea, we > have > > >> thought > > >> > > > > > > about that, but as Ryan said, it can be a big project for > GSoC > > >> > > given > > >> > > > > > > the limited period of time this year. > > >> > > > > > > > > >> > > > > > > I have several ideas to add on what Ryan said. The > objective > > >> is to > > >> > > make > > >> > > > > > > the project lighter and more fit for a GSoC. > > >> > > > > > > > > >> > > > > > > Knowing that, the data load/save part from mlpack core is > the > > >> only > > >> > > part > > >> > > > > > > that > > >> > > > > > > has implementation files (.cpp) while all methods of > mlpack > > >> are > > >> > > > > > > header-only, > > >> > > > > > > therefore: > > >> > > > > > > > > >> > > > > > > 1) It would be nicer to have mlpack as header-only by > moving > > >> these > > >> > > > > > > implementations into header files. > > >> > > > > > > > > >> > > > > > > 2) I would avoid re-implementing things that have already > > >> > > implemented, > > >> > > > > > > especially that these parts of code (Loading, Saving, > Matrix > > >> > > > > > > manipulation, and conversion) need a lot of > > >> > > > > > > optimization, which requires years of work to have > something > > >> > > feasible. > > >> > > > > > > However, looking at Xtensor library seems to be similar to > > >> Pandas > > >> > > > > providing > > >> > > > > > > what is in need, in C++, with a good performance. > > >> > > > > > > > > >> > > > > > > 3) Xtensor integration can be realized by adding a mlpack > > >> wrapper > > >> > > > > > > (a small light wrapper) for Xtensor functionalities. This > > >> wrapper > > >> > > can > > >> > > > > be > > >> > > > > > > integrated into mlpack source code, or can be kept > separately > > >> (as > > >> > > > > > > ensmallen) > > >> > > > > > > allowing to be added when needed, therefore only link with > > >> library > > >> > > that > > >> > > > > > > we use (avoiding dependencies). > > >> > > > > > > > > >> > > > > > > Knowing that the above steps will require more than one > GSoC > > >> to > > >> > > > > > > complete, but they can be done independently. You can > choose > > >> what > > >> > > you > > >> > > > > > > find the most suitable and build a proposal upon it > allowing > > >> to > > >> > > have > > >> > > > > > > the most possible decoupling between the tasks in order to > > >> > > maximize the > > >> > > > > > > possible > > >> > > > > > > feasibility of the project. > > >> > > > > > > > > >> > > > > > > I hope you find this helpful ! > > >> > > > > > > > > >> > > > > > > Thanks, > > >> > > > > > > > > >> > > > > > > Omar > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > On 03/13, Gopi Manohar Tatiraju wrote: > > >> > > > > > > > Hey Ryan, > > >> > > > > > > > > > >> > > > > > > > Thanks for the feedback. > > >> > > > > > > > I agree that this can be a very big project considering > the > > >> time > > >> > > > > span of > > >> > > > > > > > GSoC this year, if we decide to go ahead with this > project > > >> it > > >> > > will be > > >> > > > > > > very > > >> > > > > > > > important to decide on some base features as you already > > >> pointed > > >> > > out. > > >> > > > > > > > > > >> > > > > > > > how will users use this dataframe? > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > We should do it in the same way as we do DatasetInfo, > this > > >> will > > >> > > keep > > >> > > > > it > > >> > > > > > > > separate from the dataset(arma::mat) so that we don't > need > > >> to > > >> > > change > > >> > > > > how > > >> > > > > > > we > > >> > > > > > > > pass data to the agent. > > >> > > > > > > > We will create an object of class mlFrame and pass that > to > > >> the > > >> > > load > > >> > > > > > > > function. But we have to make sure that we don't end up > > >> making > > >> > > > > another > > >> > > > > > > copy > > >> > > > > > > > of the dataset here as well, might use a bit of help > here to > > >> > > create > > >> > > > > the > > >> > > > > > > > skeleton of the class. > > >> > > > > > > > > > >> > > > > > > > How will the dataframe integrate with mlpack's existing > > >> methods? > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > If we could follow the way I mentioned above we won't > need > > >> to > > >> > > change > > >> > > > > any > > >> > > > > > > > existing implementations to access or use the data. > > >> > > > > > > > > > >> > > > > > > > Let's discuss point by point, let me know what you think > > >> about > > >> > > the > > >> > > > > > > > above-mentioned way to implement it or if I need to > clear > > >> > > anything > > >> > > > > more > > >> > > > > > > > regarding this, I will address other questions soon as > we > > >> get a > > >> > > basic > > >> > > > > > > idea > > >> > > > > > > > of the project. > > >> > > > > > > > > > >> > > > > > > > Regarding the image, there is an ImageInfo class we can > > >> extend > > >> > > its > > >> > > > > > > > functionality to work on a directory of images, but I > have > > >> not > > >> > > yet > > >> > > > > > > figured > > >> > > > > > > > out if we need a way to display the methods, I mean the > info > > >> > > > > regarding > > >> > > > > > > the > > >> > > > > > > > images should be fine right? > > >> > > > > > > > > > >> > > > > > > > Also, I was thinking of adding some stats to DatasetInfo > > >> class, > > >> > > > > methods > > >> > > > > > > to > > >> > > > > > > > show the numerical summary of the dataset which can > include > > >> mean, > > >> > > > > std, > > >> > > > > > > min, > > >> > > > > > > > max, etc. These are the same methods that I suggested to > > >> > > implement in > > >> > > > > > > this > > >> > > > > > > > PR <https://github.com/mlpack/mlpack/pull/2727>. As > most > > >> of our > > >> > > > > datasets > > >> > > > > > > > are numerical for now, I think we should first implement > > >> > > > > functionality > > >> > > > > > > that > > >> > > > > > > > can be utilized by numeric datasets. Let me know what > you > > >> think. > > >> > > > > > > > > > >> > > > > > > > Thank you, > > >> > > > > > > > Gopi > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > On Fri, Mar 12, 2021 at 9:35 PM Ryan Curtin < > [email protected] > > >> > > > >> > > wrote: > > >> > > > > > > > > > >> > > > > > > > > On Wed, Mar 10, 2021 at 01:25:04AM +0530, Gopi Manohar > > >> Tatiraju > > >> > > > > wrote: > > >> > > > > > > > > > I am planning to contribute to mlapck under GSoC > 2021. > > >> > > > > Currently, I > > >> > > > > > > am > > >> > > > > > > > > > working on creating a pandas *dataframe-like class* > > >> that can > > >> > > be > > >> > > > > used > > >> > > > > > > to > > >> > > > > > > > > > analyze the datasets in a better way. > > >> > > > > > > > > > > > >> > > > > > > > > > Having a class like this would help in working with > > >> datasets > > >> > > as > > >> > > > > ml > > >> > > > > > > is not > > >> > > > > > > > > > only about the model but about data as well. > > >> > > > > > > > > > > > >> > > > > > > > > > I have a pr already open for this: > > >> > > > > > > > > > https://github.com/mlpack/mlpack/pull/2727 > > >> > > > > > > > > > > > >> > > > > > > > > > I wanted to know if I can work on this in GSoC? As > it > > >> was not > > >> > > > > listed > > >> > > > > > > on > > >> > > > > > > > > the > > >> > > > > > > > > > idea page, but I think this would be a start to > > >> something > > >> > > useful > > >> > > > > and > > >> > > > > > > big. > > >> > > > > > > > > > > >> > > > > > > > > Hey Gopi, > > >> > > > > > > > > > > >> > > > > > > > > Thanks for working on that PR. Personally I think > this > > >> kind of > > >> > > > > support > > >> > > > > > > > > would be really wonderful---right now, all non-numeric > > >> data in > > >> > > > > mlpack > > >> > > > > > > > > has to be represented via both a `data::DatasetInfo` > and > > >> an > > >> > > > > `arma::mat` > > >> > > > > > > > > (since all the internal methods operate on an > > >> `arma::mat`). > > >> > > > > > > > > > > >> > > > > > > > > The internal methods definitely can't change (they > simply > > >> > > aren't > > >> > > > > > > > > designed to work on other types of data), but we > certainly > > >> > > could > > >> > > > > > > improve > > >> > > > > > > > > the wrapper class. You are right that it probably > should > > >> look > > >> > > > > like a > > >> > > > > > > > > Pandas dataframe or similar. > > >> > > > > > > > > > > >> > > > > > > > > There is a lot of support and functionality already > in the > > >> > > > > > > > > `data::DatasetInfo` class (it's a typedef of the > > >> > > > > > > `data::DatasetMapper<>` > > >> > > > > > > > > class), and we should definitely build on top of that. > > >> > > > > > > > > > > >> > > > > > > > > If you wanted to work on this project, it would be > best to > > >> > > start > > >> > > > > with > > >> > > > > > > > > the top-level design: how will users use this > dataframe? > > >> How > > >> > > will > > >> > > > > the > > >> > > > > > > > > dataframe integrate with mlpack's existing methods? > If I > > >> had, > > >> > > > > e.g., a > > >> > > > > > > > > text dataset I wanted to use TF-IDF with, what would > that > > >> look > > >> > > > > like? > > >> > > > > > > > > What would loading a set of images look like? How > could > > >> we > > >> > > make > > >> > > > > sure > > >> > > > > > > > > that whether I had a dataframe or just an `arma::mat` > of > > >> > > numerical > > >> > > > > > > data, > > >> > > > > > > > > it "felt the same" to work with either? > > >> > > > > > > > > > > >> > > > > > > > > Those are just some of the questions that need to be > > >> answered. > > >> > > > > > > > > > > >> > > > > > > > > Take a look at this too: > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > >> > > > > > > >> > > > > >> > https://medium.com/@johan.mabille/xframe-towards-a-c-dataframe-26e1ccde211b > > >> > > > > > > > > > > >> > > > > > > > > I wonder if it would be possible to convert an xframe > > >> > > > > representation > > >> > > > > > > > > into an `arma::mat` without copying any memory at the > > >> point at > > >> > > > > which an > > >> > > > > > > > > mlpack method is called. (In fact, this even leads to > > >> another > > >> > > > > > > question: > > >> > > > > > > > > could we seamlessly support Eigen matrices by > converting > > >> an > > >> > > Eigen > > >> > > > > > > matrix > > >> > > > > > > > > to an Armadillo matrix without any conversion? Or > other > > >> matrix > > >> > > > > > > > > libraries?) > > >> > > > > > > > > > > >> > > > > > > > > Given that GSoC is a shorter period of work this year, > > >> spending > > >> > > > > lots of > > >> > > > > > > > > time implementing custom converters and things like > this > > >> is > > >> > > > > probably > > >> > > > > > > too > > >> > > > > > > > > much work---we should see how much we can wrap > existing > > >> work > > >> > > and > > >> > > > > the > > >> > > > > > > > > GSoC project should probably focus on making sure > that the > > >> > > > > interfaces > > >> > > > > > > > > all work right. > > >> > > > > > > > > > > >> > > > > > > > > I hope this is helpful! > > >> > > > > > > > > > > >> > > > > > > > > Thanks, > > >> > > > > > > > > > > >> > > > > > > > > Ryan > > >> > > > > > > > > > > >> > > > > > > > > -- > > >> > > > > > > > > Ryan Curtin | "I am." > > >> > > > > > > > > [email protected] | - Joe > > >> > > > > > > > > > > >> > > > > > > > > >> > > > > > > > _______________________________________________ > > >> > > > > > > > mlpack mailing list > > >> > > > > > > > [email protected] > > >> > > > > > > > > http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > >> > > > > > _______________________________________________ > > >> > > > > > mlpack mailing list > > >> > > > > > [email protected] > > >> > > > > > http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack > > >> > > > > > > >> > > > > > > >> > > > > >> > > >> > _______________________________________________ > > >> > mlpack mailing list > > >> > [email protected] > > >> > http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack > > >> > > >> > > >> -- > > >> Ryan Curtin | "Closets bring order to a chaotic world!" > > >> [email protected] | - Jay > > >> > > > >
_______________________________________________ mlpack mailing list [email protected] http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack
