Re: [DISCUSS] Current Publish problems

2019-01-28 Thread Hagay Lupesko
Good stuff Zach and Qing!
Great feedback Edison - would be great if you leave it in the wiki, so that
it is saved in the context of the doc with other feedback, such as Kellen's.

On Mon, Jan 28, 2019 at 1:58 AM edisongust...@gmail.com <
edisongust...@gmail.com> wrote:

> Hello all,
>
> First let me introduce myself:
>
> My name is Edison Gustavo Muenz. I have worked most of my career with C++,
> Windows and Linux. I am a big fan of machine learning and now I joined
> Amazon in Berlin to work on MXNet.
>
> I would like to give some comments on the document posted:
>
> # change publish OS (Severe)
>
> As a rule of thumb, when providing your own binaries on linux, we should
> always try to compile with oldest glibc possible. Using CentOS7 for this
> regard (if possible due to the CUDA issues) is the way to go.
>
> # Using Cent OS 7
>
> > However, all of the current GPU build scripts would be unavailable since
> nvidia does not provide the corresponding packages for rpm. In this case,
> we may need to go with NVIDIA Docker for Cent OS 7 and that only provide a
> limited versions of CUDA.
>
> > List of CUDA that NVIDIA supporting for Cent OS 7:
> > CUDA 10, 9.2, 9.1, 9.0, 8.0, 7.5
>
> From what I saw in the link provided (
> https://hub.docker.com/r/nvidia/cuda/), this list of versions is even
> bigger than the list of versions supported on Ubuntu 16.04.
>
> What am I missing?
>
> > Another problem we may see is the performance and stability difference
> on the backend we built since we downgrade libc from 2.19 to 2.17
>
> I would like to first give a brief intro so that we're all on the same
> page. If you already know how libc versioning works, then you can skip this
> part
>
> ## Brief intro on how libc versioning works
>
> In libc each symbol provided by libc has 2 components:
> - symbol name
> - version
>
> This can be seen with:
>
> ```
> $ objdump -T /lib/x86_64-linux-gnu/libc.so.6 | grep memcpy
> 000bd4a0  w   DF .text  0009  GLIBC_2.2.5 wmemcpy
> 001332f0 gDF .text  0019  GLIBC_2.4   __wmemcpy_chk
> 0009f0e0 g   iD  .text  00ca  GLIBC_2.14  memcpy
> 000bb460 gDF .text  0028 (GLIBC_2.2.5) memcpy
> 001318a0 g   iD  .text  00ca  GLIBC_2.3.4 __memcpy_chk
> ```
>
> So it can be seen that there are different memory addresses for each
> version of memcpy.
>
> When linking a binary, the linker will always choose the most recent
> version of the libc symbol.
>
> An example:
> - your program uses the `memcpy` symbol
> - when linking, the linker will choose `memcpy` at version 2.14
> (latest)
>
> When executing the binary then the libc provided on your system must have
> a memcpy at version 2.14, otherwise you get the following error:
>
> /lib/x86_64-linux-gnu/libm.so.6: version `libc_2.23' not found
> (required by /tmp/mxnet6145590735071079280/libmxnet.so)
>
> Also, a symbol has its version increased when there are breaking changes.
> So, libc will only increase the version of a symbol if any of its
> inputs/outputs changed in a non-compatible way (eg.: Changing the type of a
> field to a non-compatible type, like int -> short).
>
> ## Performance difference between versions 2.17 and 2.19
>
> This website is really handy for this:
> https://abi-laboratory.pro/?view=timeline=glibc
>
> If we look at the links:
>
> -
> https://abi-laboratory.pro/index.php?view=objects_report=glibc=2.18=2.19
> -
> https://abi-laboratory.pro/index.php?view=objects_report=glibc=2.17=2.18
>
> You can see that their binary compatibility is fine since no significant
> changes were made between these versions that could compromise the
> performance.
>
> Finally, I want to thank everyone for letting me part of this community.
>
> On 2019/01/23 21:48:48, kellen sunderland 
> wrote:
> > Hey Qing, thanks for the summary and to everyone for automating the
> > deployment process.  I've left a few comments on the doc.
> >
> > On Wed, Jan 23, 2019 at 11:46 AM Qing Lan  wrote:
> >
> > > Hi all,
> > >
> > > Recently Zach announced the availability for MXNet Maven publishing
> > > pipeline and general static-build instructions. In order to make it
> better,
> > > I drafted a document that includes the problems we have for this
> pipeline:
> > >
> https://cwiki.apache.org/confluence/display/MXNET/Outstanding+problems+with+publishing
> .
> > > Some of them may need to be addressed very soon.
> > >
> > > Please kindly review and leave any comments you may have in this
> thread or
> > > in the document.
> > >
> > > thanks,
> > > Qing
> > >
> > >
> >
>


website, docs & related features updates

2019-01-28 Thread Aaron Markham
Hello dev folks! I thought I'd let you all know some of the recent
updates related to the website, docs & related features.

MXNet CI and website build pipeline

We had few blips last week from where GitHub was unreachable. The red
block in the trend graph [1] is from the previous week’s subprocess
issue. I see that the general MKL-DNN and subprocess problem is still
not resolved [2]. Please let me know if there's something that should
be clarified in instructions because the error messaging is really not
great, and someone else running into this issue might waste a lot of
time troubleshooting it.


MXNet Website

* Error handling PR merged [3] – the redirects have been updated and I
reimplemented the error pages. They now provide a real 404 and I
imagine we’ll start getting broken link reports now. (So, I’ll be
looking at that this week, plus I will track/scope out and possibly
implement a versions dropdown or cross-versions search for these
pages.)
* C++ API docs improvement [4]– some functions and features were not
getting documented because op.h is only built when you build MXNet
with the CPP flag. This has been added and you will now be able to see
the docs for this. Note that if you run a docs build, you’ll see a ton
of warnings [5] coming from this file’s Doxygen processing. This needs
some attention. Maybe someone that knows this part of the codebase can
take a look?


MXNet Beta Website

I’m scoping out some of the remaining work to accelerate the
production launch for the beta website [6]. I’ll be looking at this
more this week and would appreciate feedback and suggestions.


MXNet Features

Looks like the lipreading example [7] from @seujung is ready. If you
haven't tried it out yet, take a look. I put the preprocessed training
set on S3 to save you time. (aws s3 sync s3://mxnet-public/lipnet/). I
think a blog post about it would be a nice follow-up to having this
merged. What would be really cool is another example using soccer
footage (ref and/or player exchanges) or other such things with
amusing or interesting outputs. Making it turnkey for people writing
accessibility apps would be great.


[1] Trend graph for website publishing job:
http://jenkins.mxnet-ci.amazon-ml.com/job/restricted-website-build/buildTimeTrend
[2] MKL-DNN subprocess issue:
https://github.com/apache/incubator-mxnet/issues/13875
[3] Error handling PR: https://github.com/apache/incubator-mxnet/pull/13963
[4] C++ API docs improvement:
https://github.com/apache/incubator-mxnet/pull/13983
[5] Comment about op.h errors:
https://github.com/apache/incubator-mxnet/issues/13920#issuecomment-455653127
[6] Beta website: http://beta.mxnet.io
[7] Lipreading example: https://github.com/apache/incubator-mxnet/pull/13647


Re: [DISCUSS] Current Publish problems

2019-01-28 Thread edisongustavo
Hello all,

First let me introduce myself:

My name is Edison Gustavo Muenz. I have worked most of my career with C++, 
Windows and Linux. I am a big fan of machine learning and now I joined Amazon 
in Berlin to work on MXNet.

I would like to give some comments on the document posted:

# change publish OS (Severe)

As a rule of thumb, when providing your own binaries on linux, we should always 
try to compile with oldest glibc possible. Using CentOS7 for this regard (if 
possible due to the CUDA issues) is the way to go.

# Using Cent OS 7

> However, all of the current GPU build scripts would be unavailable since 
> nvidia does not provide the corresponding packages for rpm. In this case, we 
> may need to go with NVIDIA Docker for Cent OS 7 and that only provide a 
> limited versions of CUDA.

> List of CUDA that NVIDIA supporting for Cent OS 7:
> CUDA 10, 9.2, 9.1, 9.0, 8.0, 7.5

>From what I saw in the link provided (https://hub.docker.com/r/nvidia/cuda/), 
>this list of versions is even bigger than the list of versions supported on 
>Ubuntu 16.04.

What am I missing?

> Another problem we may see is the performance and stability difference on the 
> backend we built since we downgrade libc from 2.19 to 2.17

I would like to first give a brief intro so that we're all on the same page. If 
you already know how libc versioning works, then you can skip this part

## Brief intro on how libc versioning works

In libc each symbol provided by libc has 2 components:
- symbol name
- version

This can be seen with:

```
$ objdump -T /lib/x86_64-linux-gnu/libc.so.6 | grep memcpy
000bd4a0  w   DF .text  0009  GLIBC_2.2.5 wmemcpy
001332f0 gDF .text  0019  GLIBC_2.4   __wmemcpy_chk
0009f0e0 g   iD  .text  00ca  GLIBC_2.14  memcpy
000bb460 gDF .text  0028 (GLIBC_2.2.5) memcpy
001318a0 g   iD  .text  00ca  GLIBC_2.3.4 __memcpy_chk
```

So it can be seen that there are different memory addresses for each version of 
memcpy.

When linking a binary, the linker will always choose the most recent version of 
the libc symbol.

An example:
- your program uses the `memcpy` symbol
- when linking, the linker will choose `memcpy` at version 2.14 (latest)

When executing the binary then the libc provided on your system must have a 
memcpy at version 2.14, otherwise you get the following error:

/lib/x86_64-linux-gnu/libm.so.6: version `libc_2.23' not found (required by 
/tmp/mxnet6145590735071079280/libmxnet.so)

Also, a symbol has its version increased when there are breaking changes. So, 
libc will only increase the version of a symbol if any of its inputs/outputs 
changed in a non-compatible way (eg.: Changing the type of a field to a 
non-compatible type, like int -> short).

## Performance difference between versions 2.17 and 2.19

This website is really handy for this: 
https://abi-laboratory.pro/?view=timeline=glibc

If we look at the links:

- 
https://abi-laboratory.pro/index.php?view=objects_report=glibc=2.18=2.19
- 
https://abi-laboratory.pro/index.php?view=objects_report=glibc=2.17=2.18

You can see that their binary compatibility is fine since no significant 
changes were made between these versions that could compromise the performance.

Finally, I want to thank everyone for letting me part of this community.

On 2019/01/23 21:48:48, kellen sunderland  wrote: 
> Hey Qing, thanks for the summary and to everyone for automating the
> deployment process.  I've left a few comments on the doc.
> 
> On Wed, Jan 23, 2019 at 11:46 AM Qing Lan  wrote:
> 
> > Hi all,
> >
> > Recently Zach announced the availability for MXNet Maven publishing
> > pipeline and general static-build instructions. In order to make it better,
> > I drafted a document that includes the problems we have for this pipeline:
> > https://cwiki.apache.org/confluence/display/MXNET/Outstanding+problems+with+publishing.
> > Some of them may need to be addressed very soon.
> >
> > Please kindly review and leave any comments you may have in this thread or
> > in the document.
> >
> > thanks,
> > Qing
> >
> >
>