The different modes make sense. One disagreement: I think the default
should be skip. I imagine few users would want to put licenses in their own
images.

On Tue, Apr 28, 2020 at 5:36 PM Hannah Jiang <[email protected]> wrote:

> The above 1,2,3,4 are merged to master.
> Here I would like to change behaviors with *docker-pull-licenses* tag and
> would like to collect feedbacks from community before finalizing my PR.
>
> docker-pull-licenses can be set to one of ['add', 'pull', 'skip'].
>
>    - *docker-pull-licenses=add* pulls licenses/notices and the files are
>    added to docker images. This tag is used to create release docker images.
>    - *docker-pull-licenses=pull* pulls licenses/notices but do not add
>    the files to docker images so that create lightweight images. This is the
>    default mode, and docker images created by Jenkins test use this mode. This
>    will make sure the files are all pullable and detect issues with the tool,
>    except the ADD part at DockerFile, which is rarely likely to be an issue.
>    It is better than the checking URLs approach mentioned above, because it is
>    a more similar process like creating release images and code can be
>    simplified so that make it easier to maintain.
>    - *docker-pull-licenses=skip* skips all license pulling related tasks.
>    This tag is provided for users who customize their images with DockerFile
>    provided by us and would not like to deal with license stuff. Without this
>    tag, the users should change the .gradle file to get rid of it. It also can
>    be used for local development when developers want to solve license related
>    issues later.
>
> I would like to see this tag is adopted by all docker images released by
> Beam, including Flink and Spark job server images, and share the same
> default mode.
> Does this sound good? Are there any concerns?
>
> Please let me know what you think.
> Hannah
>
>
>
> On Thu, Apr 16, 2020 at 11:46 AM Hannah Jiang <[email protected]>
> wrote:
>
>> Ok, so the fianl approach is
>> 1. Using multi threading with 16 threads.
>> 2. Introduce docker-pull-license tag, by default it is disabled.
>> 3. By default, only check if urls return 200, not actually pull the files
>> to reduce image size. Licenses add 85MB of image size as of today.
>> 4. Add retries to avoid flakiness. Initially it was added to download
>> part only, assuming urlib is stable when validates urls, but it is not true.
>>
>> For caching, it is definitely useful. It will mitigate flakiness further,
>> like when github is down (many urls point to github). It also reduces
>> unnecessary internet traffic.
>> Let's reconsider it after the current work is merged to 2.21.0.
>>
>> Thank you all for providing feedback and ideas.
>>
>>
>> On Thu, Apr 16, 2020 at 10:24 AM Kyle Weaver <[email protected]> wrote:
>>
>>> > I tried with multi processing and it improved the performance a lot.
>>>
>>> Great! Though it won't help with flakes, so as you said we should still
>>> look into caching as well.
>>>
>>> > You could try using a threadpool
>>>
>>> +1
>>>
>>> > However, we would like to include this work as part of 2.21.0
>>>
>>> I had marked the jira as a blocker for 2.21.0 because I was afraid
>>> something was broken, but now it looks like the failures were just flakes.
>>> So BEAM-9764 <https://issues.apache.org/jira/browse/BEAM-9764> should
>>> not be a release blocker.
>>>
>>> On Thu, Apr 16, 2020 at 1:00 PM Thomas Weise <[email protected]> wrote:
>>>
>>>> Hi Hannah,
>>>>
>>>> Thanks for investigating!
>>>>
>>>> I think it would be great to eliminate the overhead for local builds
>>>> (by default turn off the license assembly) and make it as lightweight
>>>> as possible in the frequent CI path.
>>>>
>>>> Thomas
>>>>
>>>>
>>>> On Thu, Apr 16, 2020 at 1:37 AM Hannah Jiang <[email protected]>
>>>> wrote:
>>>>
>>>>> I tried to check if urls are valid instead of pulling the files and it
>>>>> reduced only 1 min of running time. So, it's not an option here.
>>>>>
>>>>> I tried with multi processing and it improved the performance a lot.
>>>>> With 12 subprocesses, the running time reduced to 49 seconds, and with
>>>>> 16 cores, it reduced to 18 seconds.
>>>>> The number of subprocesses is defined by the number of cores, and
>>>>> Jenkins machine has 16 cores.
>>>>> FYI: with my local machine (12 cores) and home network, it takes 1min
>>>>> 40 secs to create a Java docker image.
>>>>>
>>>>> The caching approach mentioned by Robert brings many benefits, not
>>>>> only to this use case.
>>>>> However, we would like to include this work as part of 2.21.0, so I
>>>>> will move with the multi processing approach this time.
>>>>>
>>>>> Please let me know if you have objections.
>>>>>
>>>>>
>>>>> On Wed, Apr 15, 2020 at 4:01 PM Robert Bradshaw <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Is the cost primarily in pulling these remote licenses/sources? I'd
>>>>>> guess that 99.9% of the URLs remain the same from run to run. Would a
>>>>>> simple cache, or caching proxy, be sufficient?
>>>>>>
>>>>>> Otherwise, a tag to check that licenses can be pulled, but not really
>>>>>> pull them, might be sufficient. (Making sure the default is cheap but we
>>>>>> don't accidentally omit them when it matters is the tricky bit I see 
>>>>>> here.)
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 15, 2020 at 3:38 PM Hannah Jiang <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for providing feedback.
>>>>>>>
>>>>>>> Here is what happending now and I would discuss when to run the job.
>>>>>>>
>>>>>>> *Why it takes 7-8 mins for Java?*
>>>>>>> When we list dependencies both in runtime and compile environment,
>>>>>>> there are almost 1400 third party dependencies and we need to pull
>>>>>>> licenses/notices for all of them.
>>>>>>> In addition, we need to pull source code if license is CDDL, MPL,
>>>>>>> GLP or LGPL. 69 of the dependencies need to pull the source code as of
>>>>>>> 4/14/2020.
>>>>>>> Getting dependency list + pulling licenses/notices/source code takes
>>>>>>> 7-8 minutes.
>>>>>>>
>>>>>>> Now I see there are *two patterns of failures*.
>>>>>>> 1. In valid URLs. In fact, the urls are not invalid, but
>>>>>>> occationally, it returns URLError. This can be resolved by adding 
>>>>>>> retries.
>>>>>>> However, it will add runtime to the job.
>>>>>>> 2. No artifacts available. Sometimes, when a new version of package
>>>>>>> is released  and the plugin still looks for staging location. For 
>>>>>>> example,
>>>>>>> new zetasql packages were released on 4/14, and today I saw several
>>>>>>> failures with looking for staging repo. The behavior is not consistent,
>>>>>>> sometimes it scans correct location, sometimes not. This can be 
>>>>>>> resolved by
>>>>>>> running the job again.
>>>>>>>
>>>>>>> *When the job is running?*
>>>>>>> generateThirdPartyLicenses is added to :sdks:java:container and it
>>>>>>> is an upstream of the docker task. As such, whenever a docker is 
>>>>>>> created,
>>>>>>> the job is triggered.
>>>>>>> :sdks:java:container:docker is added to Java PreSubmit job.
>>>>>>>
>>>>>>> *How to improve it?*
>>>>>>> According to some ideas provided above, how about doing this?
>>>>>>> Introduce a tag (ie: pull-licenses) to docker job to decide if pull
>>>>>>> the files. Default tag is NOT setting pull-licenses.
>>>>>>> When pull-licenses is not set, it checks if the
>>>>>>> licenses/notices/source code can be pull automaticall or they have urls 
>>>>>>> to
>>>>>>> pull from, but don't really pull.
>>>>>>> When pull-license is set, files are pulled.
>>>>>>>
>>>>>>> For each PR (Presubmit): applying default option. The test would
>>>>>>> fail if the files cannot be pulled, so committers still need to fix
>>>>>>> dependency errors. I believe it would reduce the running time.
>>>>>>> For release: set the tag and pull the files and source code. Since
>>>>>>> it is checked for each PR, pulling should finish without problems.
>>>>>>>
>>>>>>> Please let me know what you think and if there are other things can
>>>>>>> be improved.
>>>>>>>
>>>>>>> Hannah
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Apr 15, 2020 at 2:30 PM Kyle Weaver <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Looks like the same error as this Jira:
>>>>>>>> https://issues.apache.org/jira/browse/BEAM-9764
>>>>>>>>
>>>>>>>> Even if/when we are able to fix this particular issue, I agree it
>>>>>>>> is best not to run this job except for releases because of the inherent
>>>>>>>> network cost and possible reliability issues. +Hannah Jiang
>>>>>>>> <[email protected]> What do you think?
>>>>>>>>
>>>>>>>> On Wed, Apr 15, 2020 at 5:20 PM Thomas Weise <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The new feature to assemble licenses is very useful but appears to
>>>>>>>>> add several minutes (7-8?)  build time to jobs that need to build a
>>>>>>>>> container.
>>>>>>>>>
>>>>>>>>> Does it also seem to cause occasional build failures?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Phrase/131/
>>>>>>>>>
>>>>>>>>> Would it be possible to perform this task only during release
>>>>>>>>> builds?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Thomas
>>>>>>>>>
>>>>>>>>>

Reply via email to