Can this be solved by enabling the license magic for the Jenkins jobs?

I also think the default should be off for better local development
experience.

On Tue, Apr 28, 2020 at 3:05 PM Hannah Jiang <[email protected]> wrote:

> We need to make sure the release images contains licenses/notices in order
> to avoid potential legal issues. The purpose of setting the default to pull
> is checking PRs which introduce new dependencies include thier licenses as
> well, in a way of auto pulling or tool pulling, to make sure all licenses
> are included when we create release images. If setting default to skip, we
> only check licenses when we create release images and may see many issues
> with the licenses, and the release would be delayed, maybe a lot.
>
>
> On Tue, Apr 28, 2020 at 2:50 PM Kyle Weaver <[email protected]> wrote:
>
>> The different modes make sense. One disagreement: I think the default
>> should be skip. I imagine few users would want to put licenses in their own
>> images.
>>
>> On Tue, Apr 28, 2020 at 5:36 PM Hannah Jiang <[email protected]>
>> wrote:
>>
>>> The above 1,2,3,4 are merged to master.
>>> Here I would like to change behaviors with *docker-pull-licenses* tag
>>> and would like to collect feedbacks from community before finalizing my PR.
>>>
>>> docker-pull-licenses can be set to one of ['add', 'pull', 'skip'].
>>>
>>>    - *docker-pull-licenses=add* pulls licenses/notices and the files
>>>    are added to docker images. This tag is used to create release docker
>>>    images.
>>>    - *docker-pull-licenses=pull* pulls licenses/notices but do not add
>>>    the files to docker images so that create lightweight images. This is the
>>>    default mode, and docker images created by Jenkins test use this mode. 
>>> This
>>>    will make sure the files are all pullable and detect issues with the 
>>> tool,
>>>    except the ADD part at DockerFile, which is rarely likely to be an issue.
>>>    It is better than the checking URLs approach mentioned above, because it 
>>> is
>>>    a more similar process like creating release images and code can be
>>>    simplified so that make it easier to maintain.
>>>    - *docker-pull-licenses=skip* skips all license pulling related
>>>    tasks. This tag is provided for users who customize their images with
>>>    DockerFile provided by us and would not like to deal with license stuff.
>>>    Without this tag, the users should change the .gradle file to get rid of
>>>    it. It also can be used for local development when developers want to 
>>> solve
>>>    license related issues later.
>>>
>>> I would like to see this tag is adopted by all docker images released by
>>> Beam, including Flink and Spark job server images, and share the same
>>> default mode.
>>> Does this sound good? Are there any concerns?
>>>
>>> Please let me know what you think.
>>> Hannah
>>>
>>>
>>>
>>> On Thu, Apr 16, 2020 at 11:46 AM Hannah Jiang <[email protected]>
>>> wrote:
>>>
>>>> Ok, so the fianl approach is
>>>> 1. Using multi threading with 16 threads.
>>>> 2. Introduce docker-pull-license tag, by default it is disabled.
>>>> 3. By default, only check if urls return 200, not actually pull the
>>>> files to reduce image size. Licenses add 85MB of image size as of today.
>>>> 4. Add retries to avoid flakiness. Initially it was added to download
>>>> part only, assuming urlib is stable when validates urls, but it is not 
>>>> true.
>>>>
>>>> For caching, it is definitely useful. It will mitigate flakiness
>>>> further, like when github is down (many urls point to github). It also
>>>> reduces unnecessary internet traffic.
>>>> Let's reconsider it after the current work is merged to 2.21.0.
>>>>
>>>> Thank you all for providing feedback and ideas.
>>>>
>>>>
>>>> On Thu, Apr 16, 2020 at 10:24 AM Kyle Weaver <[email protected]>
>>>> wrote:
>>>>
>>>>> > I tried with multi processing and it improved the performance a lot.
>>>>>
>>>>> Great! Though it won't help with flakes, so as you said we should
>>>>> still look into caching as well.
>>>>>
>>>>> > You could try using a threadpool
>>>>>
>>>>> +1
>>>>>
>>>>> > However, we would like to include this work as part of 2.21.0
>>>>>
>>>>> I had marked the jira as a blocker for 2.21.0 because I was afraid
>>>>> something was broken, but now it looks like the failures were just flakes.
>>>>> So BEAM-9764 <https://issues.apache.org/jira/browse/BEAM-9764> should
>>>>> not be a release blocker.
>>>>>
>>>>> On Thu, Apr 16, 2020 at 1:00 PM Thomas Weise <[email protected]> wrote:
>>>>>
>>>>>> Hi Hannah,
>>>>>>
>>>>>> Thanks for investigating!
>>>>>>
>>>>>> I think it would be great to eliminate the overhead for local builds
>>>>>> (by default turn off the license assembly) and make it as lightweight
>>>>>> as possible in the frequent CI path.
>>>>>>
>>>>>> Thomas
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 16, 2020 at 1:37 AM Hannah Jiang <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I tried to check if urls are valid instead of pulling the files and
>>>>>>> it reduced only 1 min of running time. So, it's not an option here.
>>>>>>>
>>>>>>> I tried with multi processing and it improved the performance a lot.
>>>>>>> With 12 subprocesses, the running time reduced to 49 seconds, and
>>>>>>> with 16 cores, it reduced to 18 seconds.
>>>>>>> The number of subprocesses is defined by the number of cores, and
>>>>>>> Jenkins machine has 16 cores.
>>>>>>> FYI: with my local machine (12 cores) and home network, it takes
>>>>>>> 1min 40 secs to create a Java docker image.
>>>>>>>
>>>>>>> The caching approach mentioned by Robert brings many benefits, not
>>>>>>> only to this use case.
>>>>>>> However, we would like to include this work as part of 2.21.0, so I
>>>>>>> will move with the multi processing approach this time.
>>>>>>>
>>>>>>> Please let me know if you have objections.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Apr 15, 2020 at 4:01 PM Robert Bradshaw <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Is the cost primarily in pulling these remote licenses/sources? I'd
>>>>>>>> guess that 99.9% of the URLs remain the same from run to run. Would a
>>>>>>>> simple cache, or caching proxy, be sufficient?
>>>>>>>>
>>>>>>>> Otherwise, a tag to check that licenses can be pulled, but not
>>>>>>>> really pull them, might be sufficient. (Making sure the default is 
>>>>>>>> cheap
>>>>>>>> but we don't accidentally omit them when it matters is the tricky bit 
>>>>>>>> I see
>>>>>>>> here.)
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Apr 15, 2020 at 3:38 PM Hannah Jiang <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Thanks for providing feedback.
>>>>>>>>>
>>>>>>>>> Here is what happending now and I would discuss when to run the
>>>>>>>>> job.
>>>>>>>>>
>>>>>>>>> *Why it takes 7-8 mins for Java?*
>>>>>>>>> When we list dependencies both in runtime and compile environment,
>>>>>>>>> there are almost 1400 third party dependencies and we need to pull
>>>>>>>>> licenses/notices for all of them.
>>>>>>>>> In addition, we need to pull source code if license is CDDL, MPL,
>>>>>>>>> GLP or LGPL. 69 of the dependencies need to pull the source code as of
>>>>>>>>> 4/14/2020.
>>>>>>>>> Getting dependency list + pulling licenses/notices/source code
>>>>>>>>> takes 7-8 minutes.
>>>>>>>>>
>>>>>>>>> Now I see there are *two patterns of failures*.
>>>>>>>>> 1. In valid URLs. In fact, the urls are not invalid, but
>>>>>>>>> occationally, it returns URLError. This can be resolved by adding 
>>>>>>>>> retries.
>>>>>>>>> However, it will add runtime to the job.
>>>>>>>>> 2. No artifacts available. Sometimes, when a new version of
>>>>>>>>> package is released  and the plugin still looks for staging location. 
>>>>>>>>> For
>>>>>>>>> example, new zetasql packages were released on 4/14, and today I saw
>>>>>>>>> several failures with looking for staging repo. The behavior is not
>>>>>>>>> consistent, sometimes it scans correct location, sometimes not. This 
>>>>>>>>> can be
>>>>>>>>> resolved by running the job again.
>>>>>>>>>
>>>>>>>>> *When the job is running?*
>>>>>>>>> generateThirdPartyLicenses is added to :sdks:java:container and it
>>>>>>>>> is an upstream of the docker task. As such, whenever a docker is 
>>>>>>>>> created,
>>>>>>>>> the job is triggered.
>>>>>>>>> :sdks:java:container:docker is added to Java PreSubmit job.
>>>>>>>>>
>>>>>>>>> *How to improve it?*
>>>>>>>>> According to some ideas provided above, how about doing this?
>>>>>>>>> Introduce a tag (ie: pull-licenses) to docker job to decide if
>>>>>>>>> pull the files. Default tag is NOT setting pull-licenses.
>>>>>>>>> When pull-licenses is not set, it checks if the
>>>>>>>>> licenses/notices/source code can be pull automaticall or they have 
>>>>>>>>> urls to
>>>>>>>>> pull from, but don't really pull.
>>>>>>>>> When pull-license is set, files are pulled.
>>>>>>>>>
>>>>>>>>> For each PR (Presubmit): applying default option. The test would
>>>>>>>>> fail if the files cannot be pulled, so committers still need to fix
>>>>>>>>> dependency errors. I believe it would reduce the running time.
>>>>>>>>> For release: set the tag and pull the files and source code. Since
>>>>>>>>> it is checked for each PR, pulling should finish without problems.
>>>>>>>>>
>>>>>>>>> Please let me know what you think and if there are other things
>>>>>>>>> can be improved.
>>>>>>>>>
>>>>>>>>> Hannah
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Apr 15, 2020 at 2:30 PM Kyle Weaver <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Looks like the same error as this Jira:
>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-9764
>>>>>>>>>>
>>>>>>>>>> Even if/when we are able to fix this particular issue, I agree it
>>>>>>>>>> is best not to run this job except for releases because of the 
>>>>>>>>>> inherent
>>>>>>>>>> network cost and possible reliability issues. +Hannah Jiang
>>>>>>>>>> <[email protected]> What do you think?
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 15, 2020 at 5:20 PM Thomas Weise <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> The new feature to assemble licenses is very useful but
>>>>>>>>>>> appears to add several minutes (7-8?)  build time to jobs that need 
>>>>>>>>>>> to
>>>>>>>>>>> build a container.
>>>>>>>>>>>
>>>>>>>>>>> Does it also seem to cause occasional build failures?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Phrase/131/
>>>>>>>>>>>
>>>>>>>>>>> Would it be possible to perform this task only during release
>>>>>>>>>>> builds?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Thomas
>>>>>>>>>>>
>>>>>>>>>>>

Reply via email to