Hi Ryan, Thanks for your feedback. Sorry for the late reply as I was thinking about how would I approach the project and various other details.
I apologize in advance if the mail gets too long. Brief Overview: The overall interface will be the same as you mentioned in issue thread #2421 <https://github.com/mlpack/mlpack/issues/2421>. The best way this can be explained is with an example. Let me take the example of the "RandomForest" class. Here we have defined various methods such as "Train", "Predict" etc. in c++. For the bindings, we can define multiple functions in separate files instead of a single function in one "_main.cpp" file. Each function will perform separate tasks. So, the directory structure could look something like: random_forest/ ------random_forest.hpp ------random_forest_impl.hpp ------bindings/ ------------random_forest_train.cpp /* has the train function for bindings */ ------------random_forest_predict.cpp /* has the predict function for bindings */ Each file in the "bindings/" directory contains a separate function that can be wrapped inside a method of a class/struct of the required programming language. So, for python, we can have a "RandomForestPy" class. This class will have methods like "train", "predict" that would internally call these functions. After a survey of the different programming languages that mlpack has bindings in, I think that this kind of interface can be supported either by using structs (in go, julia) or classes (in python, R). I have also thought about the questions you mentioned in the issue description. For the sake of clarity, I will be referring to the methods implemented in mlpack (eg- LinearRegression, RandomForest, etc.) as "mlpack_methods" and methods belonging to a class/struct (eg- Train(), Predict(), etc.) as "member_methods". Also the function corresponding to each member_method (functions defined in random_forest_train.cpp, random_forest_predict.cpp, etc.) is referred to as a "functionality". So, the functionality would be wrapped inside a member_method defined in a class/struct inside the required programming language. Q1) "Does it make sense to revamp the mlpack bindings into separate bindings for model training and model prediction?" In answer to this question, I have prepared a list of advantages that this interface might provide. 1) It will break the rigidity of the current interface while keeping the interface fundamentally strong. 2) Make the user more comfortable and give the user more access. 3) Make mlpack compatible with other popular libraries. For this, I am not completely aware of other languages but for python, we can make the mlpack_methods compatible with scikit-learn (similar to what "catboost" and "xgboost" libraries have done). This may not be possible in the summer due to limited time but can be a future plan. 4) Make it easier for different contributors who are working on the bindings of the same mlpack_method to collaborate as each function would be present in a separate file. Q2) "If so, what restrictions can we place on the bindings so that they fit this format? i.e. only one output parameter?" I could not completely understand what you meant here by "one output parameter". But I have a plan to formalize this idea. We can categorize each mlpack_method into various categories. Each category will have a set of basic functionalities that should be provided to the bindings through the member_methods. Following are the categories that we can use: (These are picked from the mlpack docs page. We can edit this list accordingly. Maybe you can suggest some changes?) 1) Transformations 2) Regression 3) Classification 4) Clustering 5) Preprocessing 6) Geometry 5) Others For the "Regression" category we can have some basic member_methods such as "fit", "predict", "score", "get_params". For the "Classification" category we can have "fit", "predict", "predict_proba", "score", "get_params". I am still working to find an exhaustive list of the basic member_methods we should provide to the user that are common to all mlpack_methods present inside the same category(it would be great to have some suggestions here). Now, after these basic member_methods have been provided, each mlpack_method might have some special/unique functionality that we would like to provide. For example: in Adaboost, we might want to provide the user with the weights corresponding to each weak learner, for that we can add a member_method called "weights" to the existing basic member_methods and create a corresponding functionality that will be called through the member_method. Using this, we can increase the accessibility of mlpack and capture the most out of all mlpack_methods while keeping the process automatic. Q3) "What do we do with bindings that don't fit that format? Do we need a couple more abstractions? e.g., NMF doesn't fit cleanly into train/predict ... it's just a transformation." I think this is answered in the previous part. This issue can be tackled by categorizing each mlpack_method. Q4) "Is there a way that we could manage to avoid the multiple loading cost issue for the command-line bindings by somehow "combining" bindings that are marked as "grouped" in the CMake configuration or something?" There is no way that classes/structs can be accessed from the command-line. So, here the best option can be to combine the functionalities to generate a single function that can be used from the command-line (like the current implementation). Q5) "If we could "group" bindings together in CMake, could this then be used to generate, e.g., a Python class for each set of bindings? So we could actually have a RandomForest object that isn't just an opaque pointer but actually has functions that return something?" Though the class/struct that we will provide will still be a wrapper it will have functions that return various things. Such as the "score" member_method can return RMSE score for regression and F1 score for classification. Q6) "...how would we restructure our binding documentation?" This might be the biggest task because all examples would also have to be changed. To keep the documentation up-to-date with the most recent interface we can update it simultaneously instead of keeping the task for later (this would require help from the community as it would be difficult for a single person to go over all the documentation in all languages). I hope that I was able to convey my ideas clearly and provided satisfactory answers. Though I have been contributing to mlpack for a while now there can still be things that I do not understand. In that case, please correct me if I mentioned anything wrong. I am still working on finding an exhaustive list of basic member_methods for each category. After that, I will work to create bindings for a single mlpack_method as a proof-of-concept. Please let me know what you think about this. Feedbacks from everyone are welcome. Regards, Nippun Sharma Github: NippunSharma
_______________________________________________ mlpack mailing list [email protected] http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack
