Ok, so the fianl approach is
1. Using multi threading with 16 threads.
2. Introduce docker-pull-license tag, by default it is disabled.
3. By default, only check if urls return 200, not actually pull the files
to reduce image size. Licenses add 85MB of image size as of today.
4. Add retries to avoid flakiness. Initially it was added to download part
only, assuming urlib is stable when validates urls, but it is not true.

For caching, it is definitely useful. It will mitigate flakiness further,
like when github is down (many urls point to github). It also reduces
unnecessary internet traffic.
Let's reconsider it after the current work is merged to 2.21.0.

Thank you all for providing feedback and ideas.


On Thu, Apr 16, 2020 at 10:24 AM Kyle Weaver <[email protected]> wrote:

> > I tried with multi processing and it improved the performance a lot.
>
> Great! Though it won't help with flakes, so as you said we should still
> look into caching as well.
>
> > You could try using a threadpool
>
> +1
>
> > However, we would like to include this work as part of 2.21.0
>
> I had marked the jira as a blocker for 2.21.0 because I was afraid
> something was broken, but now it looks like the failures were just flakes.
> So BEAM-9764 <https://issues.apache.org/jira/browse/BEAM-9764> should not
> be a release blocker.
>
> On Thu, Apr 16, 2020 at 1:00 PM Thomas Weise <[email protected]> wrote:
>
>> Hi Hannah,
>>
>> Thanks for investigating!
>>
>> I think it would be great to eliminate the overhead for local builds (by
>> default turn off the license assembly) and make it as lightweight
>> as possible in the frequent CI path.
>>
>> Thomas
>>
>>
>> On Thu, Apr 16, 2020 at 1:37 AM Hannah Jiang <[email protected]>
>> wrote:
>>
>>> I tried to check if urls are valid instead of pulling the files and it
>>> reduced only 1 min of running time. So, it's not an option here.
>>>
>>> I tried with multi processing and it improved the performance a lot.
>>> With 12 subprocesses, the running time reduced to 49 seconds, and with
>>> 16 cores, it reduced to 18 seconds.
>>> The number of subprocesses is defined by the number of cores, and
>>> Jenkins machine has 16 cores.
>>> FYI: with my local machine (12 cores) and home network, it takes 1min 40
>>> secs to create a Java docker image.
>>>
>>> The caching approach mentioned by Robert brings many benefits, not only
>>> to this use case.
>>> However, we would like to include this work as part of 2.21.0, so I will
>>> move with the multi processing approach this time.
>>>
>>> Please let me know if you have objections.
>>>
>>>
>>> On Wed, Apr 15, 2020 at 4:01 PM Robert Bradshaw <[email protected]>
>>> wrote:
>>>
>>>> Is the cost primarily in pulling these remote licenses/sources? I'd
>>>> guess that 99.9% of the URLs remain the same from run to run. Would a
>>>> simple cache, or caching proxy, be sufficient?
>>>>
>>>> Otherwise, a tag to check that licenses can be pulled, but not really
>>>> pull them, might be sufficient. (Making sure the default is cheap but we
>>>> don't accidentally omit them when it matters is the tricky bit I see here.)
>>>>
>>>>
>>>> On Wed, Apr 15, 2020 at 3:38 PM Hannah Jiang <[email protected]>
>>>> wrote:
>>>>
>>>>> Thanks for providing feedback.
>>>>>
>>>>> Here is what happending now and I would discuss when to run the job.
>>>>>
>>>>> *Why it takes 7-8 mins for Java?*
>>>>> When we list dependencies both in runtime and compile environment,
>>>>> there are almost 1400 third party dependencies and we need to pull
>>>>> licenses/notices for all of them.
>>>>> In addition, we need to pull source code if license is CDDL, MPL, GLP
>>>>> or LGPL. 69 of the dependencies need to pull the source code as of
>>>>> 4/14/2020.
>>>>> Getting dependency list + pulling licenses/notices/source code takes
>>>>> 7-8 minutes.
>>>>>
>>>>> Now I see there are *two patterns of failures*.
>>>>> 1. In valid URLs. In fact, the urls are not invalid, but occationally,
>>>>> it returns URLError. This can be resolved by adding retries. However, it
>>>>> will add runtime to the job.
>>>>> 2. No artifacts available. Sometimes, when a new version of package is
>>>>> released  and the plugin still looks for staging location. For example, 
>>>>> new
>>>>> zetasql packages were released on 4/14, and today I saw several failures
>>>>> with looking for staging repo. The behavior is not consistent, sometimes 
>>>>> it
>>>>> scans correct location, sometimes not. This can be resolved by running the
>>>>> job again.
>>>>>
>>>>> *When the job is running?*
>>>>> generateThirdPartyLicenses is added to :sdks:java:container and it is
>>>>> an upstream of the docker task. As such, whenever a docker is created, the
>>>>> job is triggered.
>>>>> :sdks:java:container:docker is added to Java PreSubmit job.
>>>>>
>>>>> *How to improve it?*
>>>>> According to some ideas provided above, how about doing this?
>>>>> Introduce a tag (ie: pull-licenses) to docker job to decide if pull
>>>>> the files. Default tag is NOT setting pull-licenses.
>>>>> When pull-licenses is not set, it checks if the
>>>>> licenses/notices/source code can be pull automaticall or they have urls to
>>>>> pull from, but don't really pull.
>>>>> When pull-license is set, files are pulled.
>>>>>
>>>>> For each PR (Presubmit): applying default option. The test would fail
>>>>> if the files cannot be pulled, so committers still need to fix dependency
>>>>> errors. I believe it would reduce the running time.
>>>>> For release: set the tag and pull the files and source code. Since it
>>>>> is checked for each PR, pulling should finish without problems.
>>>>>
>>>>> Please let me know what you think and if there are other things can be
>>>>> improved.
>>>>>
>>>>> Hannah
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Apr 15, 2020 at 2:30 PM Kyle Weaver <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Looks like the same error as this Jira:
>>>>>> https://issues.apache.org/jira/browse/BEAM-9764
>>>>>>
>>>>>> Even if/when we are able to fix this particular issue, I agree it is
>>>>>> best not to run this job except for releases because of the inherent
>>>>>> network cost and possible reliability issues. +Hannah Jiang
>>>>>> <[email protected]> What do you think?
>>>>>>
>>>>>> On Wed, Apr 15, 2020 at 5:20 PM Thomas Weise <[email protected]> wrote:
>>>>>>
>>>>>>> The new feature to assemble licenses is very useful but appears to
>>>>>>> add several minutes (7-8?)  build time to jobs that need to build a
>>>>>>> container.
>>>>>>>
>>>>>>> Does it also seem to cause occasional build failures?
>>>>>>>
>>>>>>>
>>>>>>> https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Phrase/131/
>>>>>>>
>>>>>>> Would it be possible to perform this task only during release builds?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Thomas
>>>>>>>
>>>>>>>

Reply via email to