The key complain here is mainly about the clarity of the documents themselves. Maybe it is time to focus on a single flavor of API that is useful(Gluon) and highlight all the docs around that
Tianqi On Wed, Sep 19, 2018 at 11:04 AM Qing Lan <[email protected]> wrote: > Hi all, > > There was a trend topic<https://www.zhihu.com/question/293996867> in > Zhihu (a famous Chinese Stackoverflow+Quora) asking about the status of > MXNet in 2018 recently. Mu replied the thread and obtained more than 300+ > `like`. > However there are a few concerns addressed in the comments of this thread, > I have done some simple translation from Chinese to English: > > 1. Documentations! Until now, the online doc still contains: > 1. Depreciated but not updated doc > 2. Wrong documentation with poor description > 3. Document in Alpha stage such as you must install `pip > –pre` in order to run. > > 2. Examples! For Gluon specifically, many examples are still mixing > Gluon/MXNet apis. The mixure of mx.sym, mx.nd mx.gluon confused the users > of what is the right one to choose in order to get their model to work. As > an example, Although Gluon made data encapsulation possible, still there > are examples using mxn.io.ImageRecordIter with tens of params (feels like > gluon examples are simply the copy from old Python examples). > > 3. Examples again! Comparing to PyTorch, there are a few examples I don't > like in Gluon: > 1. Available to run however the code structure is still > very complicated. Such as example/image-classification/cifar10.py. It > seemed like a consecutive code concatenation. In fact, these are just a > series of layers mixed with model.fit. It makes user very hard to > modify/extend the model. > 2. Only available to run with certain settings. If users > try to change a little bit in the model, crashes will happen. For example, > the multi-gpu example in Gluon website, MXNet hide the logic that using > batch size to change learning rate in a optimizer. A lot of newbies didn't > know this fact and they would only find that the model stopped converging > when batch size changed. > 3. The worst scenario is the model itself just simply > didn't work. Maintainers in the MXNet community didn't run the model (even > no integration test) and merge the code directly. It makes the script not > able run till somebody raise the issues and fix it. > > 4. The Community problem. The core advantage for MXNet is it's scalability > and efficiency. However, the documentation of some tools are confusing. > Here are two examples: > > 1. im2rec contains 2 versions, C++ (binary) and python. > But nobody would thought that the argparse in these tools are different (in > the meantime, there is no appropriate examples to compare with, users could > only use them by guessing the usage). > > 2. How to combine MXNet distributed platform with > supercomputing tool such as Slurm? How do we do profiling and how to debug. > A couples of companies I knew thought of using MXNet for distributed > training. Due to lack of examples and poor support from the community, they > have to change their models into TensorFlow and Horovod. > > 5. The heavy code base. Most of the MXNet examples/source > code/documentation/language binding are in a single repo. A git clone > operation will cost tens of Mb. The New feature PR would takes longer time > than expected. The poor reviewing response / rules keeps new contributors > away from the community. I remember there was a call for > document-improvement last year. The total timeline cost a user 3 months of > time to merge into master. It almost equals to a release interval of > Pytorch. > > 6. To Developers. There are very few people in the community discussed the > improvement we can take to make MXNet more user-friendly. It's been so easy > to trigger tens of stack issues during coding. Again, is that a requirement > for MXNet users to be familiar with C++? The connection between Python and > C lacks a IDE lint (maybe MXNet assume every developers as a VIM master). > API/underlying implementation chaged frequently. People have to release > their code with an achieved version of MXNet (such as TuSimple and MSRA). > Let's take a look at PyTorch, an API used move tensor to device would raise > a thorough discussion. > > There will be more comments translated to English and I will keep this > thread updated… > Thanks, > Qing >
