Re: [MXNET 2.0 Wishlist] [DISCUSS] Backend choices during runtime

2019-04-12 Thread Tianqi Chen
+1.

While I like slack, personally,  I don't think we should treat slack as
public-archive. "everything that happens (also) happens in dev@"

Tianqi



On Fri, Apr 12, 2019 at 1:19 AM Marco de Abreu 
wrote:

> I'd prefer if we keep discussions on the dev-list instead of slack - feel
> free to open another thread.
>
> -Marco
>
> Pedro Larroy  schrieb am Fr., 12. Apr. 2019,
> 02:24:
>
> > I will respond in slack, so we don't derail the original thread's
> > topic with my points.
> >
> > Looking forward to your proposal.
> >
> > On Thu, Apr 11, 2019 at 1:00 PM Junru Shao 
> > wrote:
> > >
> > > I don't have idea about the following issues:
> > >
> > > 1) Reducing the abuse of inlined code moving more logic to
> implementation
> > > files and improve scoping which will also speed up compilation
> > > 2) Reduce runtime of some unit tests
> > > 3) Improve MXNet startup time
> > >
> > > Will be super interested to hear about your ideas :-)
> > >
> > >
> > > On Thu, Apr 11, 2019 at 12:52 PM Junru Shao 
> > wrote:
> > >
> > > > We have a systematic solution to go without ABI headache. I am
> > struggling
> > > > with some errants, and will share our proposal here as soon as I
> could.
> > > > This will be very interesting topic to discuss. Let's work hard
> > together
> > > > and make it perfect :-)
> > > >
> > > > On Thu, Apr 11, 2019 at 12:43 PM Pedro Larroy <
> > > > pedro.larroy.li...@gmail.com> wrote:
> > > >
> > > >> Thanks Marco for raising this issue. I think we can certainly do
> some
> > > >> improvements in modularization and build. At the same time Tianqi's
> > > >> point of view is important to consider and on point. I see a high
> risk
> > > >> of overengineering in such endeavor.
> > > >>
> > > >> I also see increased complexity, difficulty debugging, C++ ABI
> > > >> headaches, API compatibility, crashes inside a binary module, etc.
> > > >> which I don't want to deal with as a developer or even as an MXNet
> > > >> user. Does somebody have answers to these problems?
> > > >>
> > > >> If somebody thinks they have a good solution, by all means propose a
> > > >> design in the wiki, I think we are all open. Personally I see
> several
> > > >> other lower hanging fruits which need our attention:
> > > >>  * Simplifying our build logic,
> > > >>  * Cuda selection in CMake,
> > > >>  * Reducing the abuse of inlined code moving more logic to
> > > >> implementation files and improve scoping which will also speed up
> > > >> compilation, (some units take more than 5 minutes to build and lots
> of
> > > >> RAM in a top of the line CPU core)
> > > >>  * Reduce runtime of some unit tests
> > > >> And other  improvements in our codebase that would bring immediate
> > > >> benefits without the risks of overengineering of a plugin system. I
> > > >> also question our bandwidth for such an endeavor.
> > > >>  * Improve MXNet startup time.
> > > >>  * Thread safety
> > > >>
> > > >> I would say, let's apply the KISS principle, let's make the project
> > > >> fast to build, easy to work on, well documented and easy to
> contribute
> > > >> to before building the next Netscape browser. Otherwise we could
> save
> > > >> ourselves this exercise and switch to Rust directly.
> > > >>
> > > >> Pedro.
> > > >>
> > > >>
> > > >>
> > > >> On Mon, Apr 8, 2019 at 9:42 AM Tianqi Chen <
> tqc...@cs.washington.edu>
> > > >> wrote:
> > > >> >
> > > >> > Just to clarify. I am not questioning the usefulness of the
> > separation.
> > > >> > Just want to highlight the technical challenges here based on our
> > past
> > > >> > experiences.
> > > >> >
> > > >> > Crossing DLL boundaries in C++ can create quite a lot of problems,
> > > >> > especially some of the dependencies used a different version of
> the
> > > >> > compiler, follows static packaging or simply because of the
> dynamic
> > > >> linking
> > > >> > difference in windows. These problems could make this direction
> move
> > > >> less
> > > >> > appealing compared to focusing effort on other things.
> > > >> >
> > > >> > Technically, as a first step, it is possible to make dependencies
> > change
> > > >> > not change the global header files and via registration so that
> > changing
> > > >> > certain component won't trigger a global recompile in CMake. This
> is
> > > >> also a
> > > >> > required step toward some modularity.
> > > >> >
> > > >> > For plugins, solutions that use C ABI can be used for certain
> plugin
> > > >> > modules.
> > > >> >
> > > >> > Some of the discussion has been tied to what the interface should
> > look
> > > >> > like. I think we should use different threads for these and puts
> in
> > more
> > > >> > thoughts.
> > > >> >
> > > >> > Tianqi
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Sun, Apr 7, 2019 at 4:39 PM kellen sunderland <
> > > >> > kellen.sunderl...@gmail.com> wrote:
> > > >> >
> > > >> > > I think we can make some incremental progress.  My thoughts were
> > > >> along the
> > > >> > > lines of plugins (thinking about what happ

Re: Benchmarking MXNet with different compilers and different OpenMP implementations (results)

2019-04-12 Thread Pedro Larroy
Are there any updates on this?

This is still affecting multiprocessing, some tests hang:

rces. For information on submitting this issue, please see
https://bugs.llvm.org/.
[INFO] Setting test np/mx/python random seeds, use
MXNET_TEST_SEED=2124604270 to reproduce.
Assertion failure at kmp_runtime.cpp(6479): __kmp_thread_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6479).
OMP: Hint: Please submit a bug report with this message, compile and
run commands used, and machine configuration info including native
compiler and operating system versions. Faster response will be
obtained by including all program sources. For information on
submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6479): __kmp_thread_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6479).
OMP: Hint: Please submit a bug report with this message, compile and
run commands used, and machine configuration info including native
compiler and operating system versions. Faster response will be
obtained by including all program sources. For information on
submitting this issue, please see https://bugs.llvm.org/.
^CException ignored in: >
Traceback (most recent call last):
  File "/home/piotr/mxnet_other/python/mxnet/gluon/data/dataloader.py",
line 595, in __del__
self._worker_pool.terminate()
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 567, in terminate
self._terminate()
  File "/usr/lib/python3.6/multiprocessing/util.py", line 186, in __call__
res = self._callback(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 597, in
_terminate_pool
cls._help_stuff_finish(inqueue, task_handler, len(pool))
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 582, in
_help_stuff_finish
inqueue._rlock.acquire()
KeyboardInterrupt

Pedro.

On Thu, Feb 14, 2019 at 6:30 AM Tsukrov, Stanislav
 wrote:
>
> Thanks Aaron for the feedback.
>
> > As for your next steps, would you propose that cmake be brought up to 
> > parity?
> Yes. sse2 in cmake vs sse3 in make is a minor example without high impact. 
> There are others.
>
> > It seems strange that it causes slowness and if so, it shouldn't be 
> > recommended for now.
> There are some issues in the cmake-files code, that should be fixed. Some of 
> them are workarounded for the benchmark.
>
> Best Regards
>
> Stas
>
> On 14.02.19, 14:09, "Anton Chernov"  wrote:
>
> Thank you, Aaron, for your interest on the topic.
>
> My main previous proposal still stands: remove bundled OpenMP submodule 
> and
> use OpenMP provided by the environment [1]. This might lead to performance
> degradation in some cases where an old OpenMP library is used or thread
> affinity wasn't set properly. But that would be a problem of the
> environment, not MXNet.
>
> I described some alternative solutions in [1] as part of this [2] thread.
> Tricking the linker with symlinks in both cases should allow to avoid
> multiple OpenMP implementations linked simultaneously to MXNet. Windows
> questions would be still open.
>
> Best
> Anton
>
> [1] https://github.com/apache/incubator-mxnet/pull/12160
> [2]
> 
> https://lists.apache.org/thread.html/007d8db15a1782e1b20896a4050b62710d4ff0908c67b94af7cb0f8b@%3Cdev.mxnet.apache.org%3E
> [3]
> 
> https://lists.apache.org/thread.html/4827f0f742b6e7e070da350ea81226d059401527f3072ce8b33c1fdf@%3Cdev.mxnet.apache.org%3E
>
>
> вт, 12 февр. 2019 г. в 16:39, Aaron Markham :
>
> > This is really great research. I've often wondered what the difference
> > really is, and why it has to be so complicated. It seems the answer is
> > there isn't much difference and it shouldn't be as complex.
> > As for your next steps, would you propose that cmake be brought up to
> > parity? It seems strange that it causes slowness and if so, it 
> shouldn't be
> > recommended for now.
> > Also, testing for windows compliers might be quite important as install
> > stats suggest a significant portion of windows users. Wouldn't this 
> nudge
> > the decision of what to use as a rule going forward?
> > I ran into this submodule openmp issue on windows myself. How does that 
> get
> > fixed? Do we have to repackage all of the submodules to make sure they 
> use
> > the recommended implementation or they use what the system expects?
> >
> > Cheers,
> > Aaron
> >
> > On Tue, Feb 12, 2019, 04:37 Anton Chernov  wrote:
> >
> > > Dear MXNet community,
> > >
> > > Due to multiple problems related to OpenMP and stale proposed change 
> [1]
> > we
> > > have been working on gathering performance data on the impact of using
> > > different OpenMP implementations with MXNet (great thanks to Stanislav
> > > Tsukrov for the hard work). The results can be found here [2].
> > >
> > > As a short summary of the investigation: T

Re: duplicated nnvm code

2019-04-12 Thread Pedro Larroy
I would think that if we are using nnvm from tvm we should not have
duplicated code in our repository. I think we either use the
subrepository as a 3rdparty or assimilate the code in the codebase as
what is planned with mshadow. But I guess TVM is making heavy use of
nnvm, and this case might make sense to reause across projects.
@Tianqi?

On Thu, Apr 11, 2019 at 10:16 PM Junru Shao  wrote:
>
> We should remove 3rdparty/tvm/nnvm/gradient.cc.o imo
>
> On Thu, Apr 11, 2019 at 6:44 PM Pedro Larroy 
> wrote:
>
> > Hi
> >
> > I found that src/nnvm  and 3rdparty/tvm/nnvm/src/pass/  has duplicated
> > code that we are linking in:
> >
> > ./CMakeFiles/mxnet_static.dir/3rdparty/tvm/nnvm/src/pass/gradient.cc.o
> > ./CMakeFiles/mxnet_static.dir/src/nnvm/gradient.cc.o
> >
> > This can potentially cause problems when linking. The symbol that will
> > be used is left as an exercise to the readers.
> >
> > Is this intentional?  Should we address this?
> >
> > Pedro.
> >