That sounds very promising.

You could try using a threadpool rather than the (more heavyweight)
multiprocessing, as this is almost certainly an IO-bound task. You won't
need to be bound by the number of cores.

On Thu, Apr 16, 2020 at 10:00 AM Thomas Weise <t...@apache.org> wrote:

> Hi Hannah,
>
> Thanks for investigating!
>
> I think it would be great to eliminate the overhead for local builds (by
> default turn off the license assembly) and make it as lightweight
> as possible in the frequent CI path.
>
> Thomas
>
>
> On Thu, Apr 16, 2020 at 1:37 AM Hannah Jiang <hannahji...@google.com>
> wrote:
>
>> I tried to check if urls are valid instead of pulling the files and it
>> reduced only 1 min of running time. So, it's not an option here.
>>
>> I tried with multi processing and it improved the performance a lot.
>> With 12 subprocesses, the running time reduced to 49 seconds, and with 16
>> cores, it reduced to 18 seconds.
>> The number of subprocesses is defined by the number of cores, and Jenkins
>> machine has 16 cores.
>> FYI: with my local machine (12 cores) and home network, it takes 1min 40
>> secs to create a Java docker image.
>>
>> The caching approach mentioned by Robert brings many benefits, not only
>> to this use case.
>> However, we would like to include this work as part of 2.21.0, so I will
>> move with the multi processing approach this time.
>>
>> Please let me know if you have objections.
>>
>>
>> On Wed, Apr 15, 2020 at 4:01 PM Robert Bradshaw <rober...@google.com>
>> wrote:
>>
>>> Is the cost primarily in pulling these remote licenses/sources? I'd
>>> guess that 99.9% of the URLs remain the same from run to run. Would a
>>> simple cache, or caching proxy, be sufficient?
>>>
>>> Otherwise, a tag to check that licenses can be pulled, but not really
>>> pull them, might be sufficient. (Making sure the default is cheap but we
>>> don't accidentally omit them when it matters is the tricky bit I see here.)
>>>
>>>
>>> On Wed, Apr 15, 2020 at 3:38 PM Hannah Jiang <hannahji...@google.com>
>>> wrote:
>>>
>>>> Thanks for providing feedback.
>>>>
>>>> Here is what happending now and I would discuss when to run the job.
>>>>
>>>> *Why it takes 7-8 mins for Java?*
>>>> When we list dependencies both in runtime and compile environment,
>>>> there are almost 1400 third party dependencies and we need to pull
>>>> licenses/notices for all of them.
>>>> In addition, we need to pull source code if license is CDDL, MPL, GLP
>>>> or LGPL. 69 of the dependencies need to pull the source code as of
>>>> 4/14/2020.
>>>> Getting dependency list + pulling licenses/notices/source code takes
>>>> 7-8 minutes.
>>>>
>>>> Now I see there are *two patterns of failures*.
>>>> 1. In valid URLs. In fact, the urls are not invalid, but occationally,
>>>> it returns URLError. This can be resolved by adding retries. However, it
>>>> will add runtime to the job.
>>>> 2. No artifacts available. Sometimes, when a new version of package is
>>>> released  and the plugin still looks for staging location. For example, new
>>>> zetasql packages were released on 4/14, and today I saw several failures
>>>> with looking for staging repo. The behavior is not consistent, sometimes it
>>>> scans correct location, sometimes not. This can be resolved by running the
>>>> job again.
>>>>
>>>> *When the job is running?*
>>>> generateThirdPartyLicenses is added to :sdks:java:container and it is
>>>> an upstream of the docker task. As such, whenever a docker is created, the
>>>> job is triggered.
>>>> :sdks:java:container:docker is added to Java PreSubmit job.
>>>>
>>>> *How to improve it?*
>>>> According to some ideas provided above, how about doing this?
>>>> Introduce a tag (ie: pull-licenses) to docker job to decide if pull the
>>>> files. Default tag is NOT setting pull-licenses.
>>>> When pull-licenses is not set, it checks if the licenses/notices/source
>>>> code can be pull automaticall or they have urls to pull from, but don't
>>>> really pull.
>>>> When pull-license is set, files are pulled.
>>>>
>>>> For each PR (Presubmit): applying default option. The test would fail
>>>> if the files cannot be pulled, so committers still need to fix dependency
>>>> errors. I believe it would reduce the running time.
>>>> For release: set the tag and pull the files and source code. Since it
>>>> is checked for each PR, pulling should finish without problems.
>>>>
>>>> Please let me know what you think and if there are other things can be
>>>> improved.
>>>>
>>>> Hannah
>>>>
>>>>
>>>>
>>>> On Wed, Apr 15, 2020 at 2:30 PM Kyle Weaver <kcwea...@google.com>
>>>> wrote:
>>>>
>>>>> Looks like the same error as this Jira:
>>>>> https://issues.apache.org/jira/browse/BEAM-9764
>>>>>
>>>>> Even if/when we are able to fix this particular issue, I agree it is
>>>>> best not to run this job except for releases because of the inherent
>>>>> network cost and possible reliability issues. +Hannah Jiang
>>>>> <hannahji...@google.com> What do you think?
>>>>>
>>>>> On Wed, Apr 15, 2020 at 5:20 PM Thomas Weise <t...@apache.org> wrote:
>>>>>
>>>>>> The new feature to assemble licenses is very useful but appears to
>>>>>> add several minutes (7-8?)  build time to jobs that need to build a
>>>>>> container.
>>>>>>
>>>>>> Does it also seem to cause occasional build failures?
>>>>>>
>>>>>>
>>>>>> https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Phrase/131/
>>>>>>
>>>>>> Would it be possible to perform this task only during release builds?
>>>>>>
>>>>>> Thanks,
>>>>>> Thomas
>>>>>>
>>>>>>

Reply via email to