Sounds good to me. Omar
On 03/22, Gopi Manohar Tatiraju wrote: > Hey, > > Thank you for your quick response. > > So I can see that we have boost::spirit in load_csv.cpp > > Currently we have 2 files for load_csv: > > 1. load_csv.hpp > 2. load_csv.cpp > > I will move the function implementations for .cpp file to .hpp file and > will declare them as inline. So we will only have one .hpp file for > load_csv. > > I will also check the compilation time before making these changes. > > Sounds good? > > Thanks, > Gopi > > On Mon, Mar 22, 2021 at 8:30 PM Omar Shrit <[email protected]> wrote: > > > On 03/22, Gopi Manohar Tatiraju wrote: > > > Hey Omar, > > > > > > Thank you for the reply... > > > So what I understood is to make files _impl.hpp we need to declare class > > > templates. > > > > > > But taking the example of *detect_file *we can see that we only have > > > functions declared here. One file for declaring the functions and one for > > > the implementation. > > > > > > So maybe, I need to structure these functions into classes, right? Coz if > > > we want to put functions directly into the header files we need to > > declare > > > them as inline to avoid the violation of C++'s one definition rule. > > > > inline should be fine for now. > > > > > Also, how should I compare the compilation time? > > > > Use the time command with make in order to get the compilation time > > > > > Also about the prototype and benchmark you mentioned, can you explain a > > bit > > > more about that. I really want to make this project work and I am willing > > > to put my time in learning new things. > > > > You can start by moving class implementation (those using boost spirit) to > > header files > > and recompile mlpack. You need to measure the compilation time before > > and after changing implementation into header, that will gives us an > > idea about the increase related to boost spirit > > > > > Thank you for the guidance. > > > > > > PS. Sorry, I was busy with my college project so it took some time to > > reply. > > > > > > Regards, > > > Gopi > > > > > > On Fri, Mar 19, 2021 at 12:42 AM Omar Shrit <[email protected]> wrote: > > > > > > > Hello Gopi, > > > > > > > > Sorry, I am forgot to answer you earlier (A little bit busy), thanks > > for > > > > the ping. > > > > > > > > On 03/18, Gopi Manohar Tatiraju wrote: > > > > > Hey Omar, > > > > > > > > > > Thank you soo much for such detailed replies and inputs. > > > > > > > > > > I see that there are many things that we need to get in order before > > we > > > > > start working on this idea. > > > > > I think we should start from this point: > > > > > > > > > > 1) It would be nicer to have mlpack as header-only by moving these > > > > > > implementations into header files. > > > > > > > > > > > > > > > I went through the mlpack/core/data/ directory and List of .cpp > > files in > > > > > core/data > > > > > > > > > > 1. detect_file_type.cpp > > > > > 2. load.cpp > > > > > 3. load_csv.cpp > > > > > 4. load_image.cpp > > > > > 5. save_image.cpp > > > > > > > > > > The convention we follow in mlpack is having two .hpp files, one for > > > > > declarations and one for implementations. > > > > > > > > The convention is related to classes that are defined as > > > > class-template. I do not think it is possible to convert an > > implementation > > > > file to `_impl.hpp` if the class is not a class template. > > > > > > > > Therefore, you can put the entire implementation in the header, where > > > > the class is defined. There are several disadvantages of putting > > everything > > > > in one place, for instance, increase compilation time and binary > > footprint. > > > > > > > > It would be nice to have a small prototype and a benchmark to see how > > > > much time the compilation time will increase, since we are trying to > > > > reduce these numbers. > > > > > > > > I did not look into the code yet, but I do not see anything that might > > > > prevent from having everything in the headers. > > > > > > > > Also, do not worry about details such as convention, these can be seen > > > > later especially if a prototype comes back with good results. > > > > > > > > > I just want to clarify some things so that I can start writing some > > code > > > > > regarding this: > > > > > > > > > > - Let's consider detect_file_type, we have for now: > > > > > - detect_file_type.hpp > > > > > < > > > > > > https://github.com/mlpack/mlpack/blob/master/src/mlpack/core/data/detect_file_type.hpp > > > > > > > > > > - detect_file_type.cpp > > > > > < > > > > > > https://github.com/mlpack/mlpack/blob/master/src/mlpack/core/data/detect_file_type.cpp > > > > > > > > > > > > > > > What we want in the future: > > > > > > > > > > > > > > > - detect_file_type.hpp > > > > > - detect_file_type_impl.hpp > > > > > > > > > > Can you just give me one example or reference or just a list of > > points > > > > that > > > > > we need to consider while restructuring these files? What will be the > > > > major > > > > > differences after converting them to .hpp files? > > > > > > > > > > > > > > > It feels great getting feedback on my idea and I get to learn soo > > many > > > > new > > > > > things on the way. > > > > > > > > > > Thank you, > > > > > Gopi. > > > > > > > > > > > > > > > On Sat, Mar 13, 2021 at 4:12 AM Omar Shrit <[email protected]> wrote: > > > > > > > > > > > Hello Gopi, > > > > > > > > > > > > The data frame class project is indeed a good idea, we have thought > > > > > > about that, but as Ryan said, it can be a big project for GSoC > > given > > > > > > the limited period of time this year. > > > > > > > > > > > > I have several ideas to add on what Ryan said. The objective is to > > make > > > > > > the project lighter and more fit for a GSoC. > > > > > > > > > > > > Knowing that, the data load/save part from mlpack core is the only > > part > > > > > > that > > > > > > has implementation files (.cpp) while all methods of mlpack are > > > > > > header-only, > > > > > > therefore: > > > > > > > > > > > > 1) It would be nicer to have mlpack as header-only by moving these > > > > > > implementations into header files. > > > > > > > > > > > > 2) I would avoid re-implementing things that have already > > implemented, > > > > > > especially that these parts of code (Loading, Saving, Matrix > > > > > > manipulation, and conversion) need a lot of > > > > > > optimization, which requires years of work to have something > > feasible. > > > > > > However, looking at Xtensor library seems to be similar to Pandas > > > > providing > > > > > > what is in need, in C++, with a good performance. > > > > > > > > > > > > 3) Xtensor integration can be realized by adding a mlpack wrapper > > > > > > (a small light wrapper) for Xtensor functionalities. This wrapper > > can > > > > be > > > > > > integrated into mlpack source code, or can be kept separately (as > > > > > > ensmallen) > > > > > > allowing to be added when needed, therefore only link with library > > that > > > > > > we use (avoiding dependencies). > > > > > > > > > > > > Knowing that the above steps will require more than one GSoC to > > > > > > complete, but they can be done independently. You can choose what > > you > > > > > > find the most suitable and build a proposal upon it allowing to > > have > > > > > > the most possible decoupling between the tasks in order to > > maximize the > > > > > > possible > > > > > > feasibility of the project. > > > > > > > > > > > > I hope you find this helpful ! > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Omar > > > > > > > > > > > > > > > > > > On 03/13, Gopi Manohar Tatiraju wrote: > > > > > > > Hey Ryan, > > > > > > > > > > > > > > Thanks for the feedback. > > > > > > > I agree that this can be a very big project considering the time > > > > span of > > > > > > > GSoC this year, if we decide to go ahead with this project it > > will be > > > > > > very > > > > > > > important to decide on some base features as you already pointed > > out. > > > > > > > > > > > > > > how will users use this dataframe? > > > > > > > > > > > > > > > > > > > > > We should do it in the same way as we do DatasetInfo, this will > > keep > > > > it > > > > > > > separate from the dataset(arma::mat) so that we don't need to > > change > > > > how > > > > > > we > > > > > > > pass data to the agent. > > > > > > > We will create an object of class mlFrame and pass that to the > > load > > > > > > > function. But we have to make sure that we don't end up making > > > > another > > > > > > copy > > > > > > > of the dataset here as well, might use a bit of help here to > > create > > > > the > > > > > > > skeleton of the class. > > > > > > > > > > > > > > How will the dataframe integrate with mlpack's existing methods? > > > > > > > > > > > > > > > > > > > > > If we could follow the way I mentioned above we won't need to > > change > > > > any > > > > > > > existing implementations to access or use the data. > > > > > > > > > > > > > > Let's discuss point by point, let me know what you think about > > the > > > > > > > above-mentioned way to implement it or if I need to clear > > anything > > > > more > > > > > > > regarding this, I will address other questions soon as we get a > > basic > > > > > > idea > > > > > > > of the project. > > > > > > > > > > > > > > Regarding the image, there is an ImageInfo class we can extend > > its > > > > > > > functionality to work on a directory of images, but I have not > > yet > > > > > > figured > > > > > > > out if we need a way to display the methods, I mean the info > > > > regarding > > > > > > the > > > > > > > images should be fine right? > > > > > > > > > > > > > > Also, I was thinking of adding some stats to DatasetInfo class, > > > > methods > > > > > > to > > > > > > > show the numerical summary of the dataset which can include mean, > > > > std, > > > > > > min, > > > > > > > max, etc. These are the same methods that I suggested to > > implement in > > > > > > this > > > > > > > PR <https://github.com/mlpack/mlpack/pull/2727>. As most of our > > > > datasets > > > > > > > are numerical for now, I think we should first implement > > > > functionality > > > > > > that > > > > > > > can be utilized by numeric datasets. Let me know what you think. > > > > > > > > > > > > > > Thank you, > > > > > > > Gopi > > > > > > > > > > > > > > > > > > > > > On Fri, Mar 12, 2021 at 9:35 PM Ryan Curtin <[email protected]> > > wrote: > > > > > > > > > > > > > > > On Wed, Mar 10, 2021 at 01:25:04AM +0530, Gopi Manohar Tatiraju > > > > wrote: > > > > > > > > > I am planning to contribute to mlapck under GSoC 2021. > > > > Currently, I > > > > > > am > > > > > > > > > working on creating a pandas *dataframe-like class* that can > > be > > > > used > > > > > > to > > > > > > > > > analyze the datasets in a better way. > > > > > > > > > > > > > > > > > > Having a class like this would help in working with datasets > > as > > > > ml > > > > > > is not > > > > > > > > > only about the model but about data as well. > > > > > > > > > > > > > > > > > > I have a pr already open for this: > > > > > > > > > https://github.com/mlpack/mlpack/pull/2727 > > > > > > > > > > > > > > > > > > I wanted to know if I can work on this in GSoC? As it was not > > > > listed > > > > > > on > > > > > > > > the > > > > > > > > > idea page, but I think this would be a start to something > > useful > > > > and > > > > > > big. > > > > > > > > > > > > > > > > Hey Gopi, > > > > > > > > > > > > > > > > Thanks for working on that PR. Personally I think this kind of > > > > support > > > > > > > > would be really wonderful---right now, all non-numeric data in > > > > mlpack > > > > > > > > has to be represented via both a `data::DatasetInfo` and an > > > > `arma::mat` > > > > > > > > (since all the internal methods operate on an `arma::mat`). > > > > > > > > > > > > > > > > The internal methods definitely can't change (they simply > > aren't > > > > > > > > designed to work on other types of data), but we certainly > > could > > > > > > improve > > > > > > > > the wrapper class. You are right that it probably should look > > > > like a > > > > > > > > Pandas dataframe or similar. > > > > > > > > > > > > > > > > There is a lot of support and functionality already in the > > > > > > > > `data::DatasetInfo` class (it's a typedef of the > > > > > > `data::DatasetMapper<>` > > > > > > > > class), and we should definitely build on top of that. > > > > > > > > > > > > > > > > If you wanted to work on this project, it would be best to > > start > > > > with > > > > > > > > the top-level design: how will users use this dataframe? How > > will > > > > the > > > > > > > > dataframe integrate with mlpack's existing methods? If I had, > > > > e.g., a > > > > > > > > text dataset I wanted to use TF-IDF with, what would that look > > > > like? > > > > > > > > What would loading a set of images look like? How could we > > make > > > > sure > > > > > > > > that whether I had a dataframe or just an `arma::mat` of > > numerical > > > > > > data, > > > > > > > > it "felt the same" to work with either? > > > > > > > > > > > > > > > > Those are just some of the questions that need to be answered. > > > > > > > > > > > > > > > > Take a look at this too: > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://medium.com/@johan.mabille/xframe-towards-a-c-dataframe-26e1ccde211b > > > > > > > > > > > > > > > > I wonder if it would be possible to convert an xframe > > > > representation > > > > > > > > into an `arma::mat` without copying any memory at the point at > > > > which an > > > > > > > > mlpack method is called. (In fact, this even leads to another > > > > > > question: > > > > > > > > could we seamlessly support Eigen matrices by converting an > > Eigen > > > > > > matrix > > > > > > > > to an Armadillo matrix without any conversion? Or other matrix > > > > > > > > libraries?) > > > > > > > > > > > > > > > > Given that GSoC is a shorter period of work this year, spending > > > > lots of > > > > > > > > time implementing custom converters and things like this is > > > > probably > > > > > > too > > > > > > > > much work---we should see how much we can wrap existing work > > and > > > > the > > > > > > > > GSoC project should probably focus on making sure that the > > > > interfaces > > > > > > > > all work right. > > > > > > > > > > > > > > > > I hope this is helpful! > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > Ryan > > > > > > > > > > > > > > > > -- > > > > > > > > Ryan Curtin | "I am." > > > > > > > > [email protected] | - Joe > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > mlpack mailing list > > > > > > > [email protected] > > > > > > > http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > mlpack mailing list > > > > > [email protected] > > > > > http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack > > > > > > > > > >
signature.asc
Description: PGP signature
_______________________________________________ mlpack mailing list [email protected] http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack
