The different modes make sense. One disagreement: I think the default should be skip. I imagine few users would want to put licenses in their own images.
On Tue, Apr 28, 2020 at 5:36 PM Hannah Jiang <[email protected]> wrote: > The above 1,2,3,4 are merged to master. > Here I would like to change behaviors with *docker-pull-licenses* tag and > would like to collect feedbacks from community before finalizing my PR. > > docker-pull-licenses can be set to one of ['add', 'pull', 'skip']. > > - *docker-pull-licenses=add* pulls licenses/notices and the files are > added to docker images. This tag is used to create release docker images. > - *docker-pull-licenses=pull* pulls licenses/notices but do not add > the files to docker images so that create lightweight images. This is the > default mode, and docker images created by Jenkins test use this mode. This > will make sure the files are all pullable and detect issues with the tool, > except the ADD part at DockerFile, which is rarely likely to be an issue. > It is better than the checking URLs approach mentioned above, because it is > a more similar process like creating release images and code can be > simplified so that make it easier to maintain. > - *docker-pull-licenses=skip* skips all license pulling related tasks. > This tag is provided for users who customize their images with DockerFile > provided by us and would not like to deal with license stuff. Without this > tag, the users should change the .gradle file to get rid of it. It also can > be used for local development when developers want to solve license related > issues later. > > I would like to see this tag is adopted by all docker images released by > Beam, including Flink and Spark job server images, and share the same > default mode. > Does this sound good? Are there any concerns? > > Please let me know what you think. > Hannah > > > > On Thu, Apr 16, 2020 at 11:46 AM Hannah Jiang <[email protected]> > wrote: > >> Ok, so the fianl approach is >> 1. Using multi threading with 16 threads. >> 2. Introduce docker-pull-license tag, by default it is disabled. >> 3. By default, only check if urls return 200, not actually pull the files >> to reduce image size. Licenses add 85MB of image size as of today. >> 4. Add retries to avoid flakiness. Initially it was added to download >> part only, assuming urlib is stable when validates urls, but it is not true. >> >> For caching, it is definitely useful. It will mitigate flakiness further, >> like when github is down (many urls point to github). It also reduces >> unnecessary internet traffic. >> Let's reconsider it after the current work is merged to 2.21.0. >> >> Thank you all for providing feedback and ideas. >> >> >> On Thu, Apr 16, 2020 at 10:24 AM Kyle Weaver <[email protected]> wrote: >> >>> > I tried with multi processing and it improved the performance a lot. >>> >>> Great! Though it won't help with flakes, so as you said we should still >>> look into caching as well. >>> >>> > You could try using a threadpool >>> >>> +1 >>> >>> > However, we would like to include this work as part of 2.21.0 >>> >>> I had marked the jira as a blocker for 2.21.0 because I was afraid >>> something was broken, but now it looks like the failures were just flakes. >>> So BEAM-9764 <https://issues.apache.org/jira/browse/BEAM-9764> should >>> not be a release blocker. >>> >>> On Thu, Apr 16, 2020 at 1:00 PM Thomas Weise <[email protected]> wrote: >>> >>>> Hi Hannah, >>>> >>>> Thanks for investigating! >>>> >>>> I think it would be great to eliminate the overhead for local builds >>>> (by default turn off the license assembly) and make it as lightweight >>>> as possible in the frequent CI path. >>>> >>>> Thomas >>>> >>>> >>>> On Thu, Apr 16, 2020 at 1:37 AM Hannah Jiang <[email protected]> >>>> wrote: >>>> >>>>> I tried to check if urls are valid instead of pulling the files and it >>>>> reduced only 1 min of running time. So, it's not an option here. >>>>> >>>>> I tried with multi processing and it improved the performance a lot. >>>>> With 12 subprocesses, the running time reduced to 49 seconds, and with >>>>> 16 cores, it reduced to 18 seconds. >>>>> The number of subprocesses is defined by the number of cores, and >>>>> Jenkins machine has 16 cores. >>>>> FYI: with my local machine (12 cores) and home network, it takes 1min >>>>> 40 secs to create a Java docker image. >>>>> >>>>> The caching approach mentioned by Robert brings many benefits, not >>>>> only to this use case. >>>>> However, we would like to include this work as part of 2.21.0, so I >>>>> will move with the multi processing approach this time. >>>>> >>>>> Please let me know if you have objections. >>>>> >>>>> >>>>> On Wed, Apr 15, 2020 at 4:01 PM Robert Bradshaw <[email protected]> >>>>> wrote: >>>>> >>>>>> Is the cost primarily in pulling these remote licenses/sources? I'd >>>>>> guess that 99.9% of the URLs remain the same from run to run. Would a >>>>>> simple cache, or caching proxy, be sufficient? >>>>>> >>>>>> Otherwise, a tag to check that licenses can be pulled, but not really >>>>>> pull them, might be sufficient. (Making sure the default is cheap but we >>>>>> don't accidentally omit them when it matters is the tricky bit I see >>>>>> here.) >>>>>> >>>>>> >>>>>> On Wed, Apr 15, 2020 at 3:38 PM Hannah Jiang <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Thanks for providing feedback. >>>>>>> >>>>>>> Here is what happending now and I would discuss when to run the job. >>>>>>> >>>>>>> *Why it takes 7-8 mins for Java?* >>>>>>> When we list dependencies both in runtime and compile environment, >>>>>>> there are almost 1400 third party dependencies and we need to pull >>>>>>> licenses/notices for all of them. >>>>>>> In addition, we need to pull source code if license is CDDL, MPL, >>>>>>> GLP or LGPL. 69 of the dependencies need to pull the source code as of >>>>>>> 4/14/2020. >>>>>>> Getting dependency list + pulling licenses/notices/source code takes >>>>>>> 7-8 minutes. >>>>>>> >>>>>>> Now I see there are *two patterns of failures*. >>>>>>> 1. In valid URLs. In fact, the urls are not invalid, but >>>>>>> occationally, it returns URLError. This can be resolved by adding >>>>>>> retries. >>>>>>> However, it will add runtime to the job. >>>>>>> 2. No artifacts available. Sometimes, when a new version of package >>>>>>> is released and the plugin still looks for staging location. For >>>>>>> example, >>>>>>> new zetasql packages were released on 4/14, and today I saw several >>>>>>> failures with looking for staging repo. The behavior is not consistent, >>>>>>> sometimes it scans correct location, sometimes not. This can be >>>>>>> resolved by >>>>>>> running the job again. >>>>>>> >>>>>>> *When the job is running?* >>>>>>> generateThirdPartyLicenses is added to :sdks:java:container and it >>>>>>> is an upstream of the docker task. As such, whenever a docker is >>>>>>> created, >>>>>>> the job is triggered. >>>>>>> :sdks:java:container:docker is added to Java PreSubmit job. >>>>>>> >>>>>>> *How to improve it?* >>>>>>> According to some ideas provided above, how about doing this? >>>>>>> Introduce a tag (ie: pull-licenses) to docker job to decide if pull >>>>>>> the files. Default tag is NOT setting pull-licenses. >>>>>>> When pull-licenses is not set, it checks if the >>>>>>> licenses/notices/source code can be pull automaticall or they have urls >>>>>>> to >>>>>>> pull from, but don't really pull. >>>>>>> When pull-license is set, files are pulled. >>>>>>> >>>>>>> For each PR (Presubmit): applying default option. The test would >>>>>>> fail if the files cannot be pulled, so committers still need to fix >>>>>>> dependency errors. I believe it would reduce the running time. >>>>>>> For release: set the tag and pull the files and source code. Since >>>>>>> it is checked for each PR, pulling should finish without problems. >>>>>>> >>>>>>> Please let me know what you think and if there are other things can >>>>>>> be improved. >>>>>>> >>>>>>> Hannah >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Apr 15, 2020 at 2:30 PM Kyle Weaver <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Looks like the same error as this Jira: >>>>>>>> https://issues.apache.org/jira/browse/BEAM-9764 >>>>>>>> >>>>>>>> Even if/when we are able to fix this particular issue, I agree it >>>>>>>> is best not to run this job except for releases because of the inherent >>>>>>>> network cost and possible reliability issues. +Hannah Jiang >>>>>>>> <[email protected]> What do you think? >>>>>>>> >>>>>>>> On Wed, Apr 15, 2020 at 5:20 PM Thomas Weise <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> The new feature to assemble licenses is very useful but appears to >>>>>>>>> add several minutes (7-8?) build time to jobs that need to build a >>>>>>>>> container. >>>>>>>>> >>>>>>>>> Does it also seem to cause occasional build failures? >>>>>>>>> >>>>>>>>> >>>>>>>>> https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Phrase/131/ >>>>>>>>> >>>>>>>>> Would it be possible to perform this task only during release >>>>>>>>> builds? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Thomas >>>>>>>>> >>>>>>>>>
