That sounds very promising. You could try using a threadpool rather than the (more heavyweight) multiprocessing, as this is almost certainly an IO-bound task. You won't need to be bound by the number of cores.
On Thu, Apr 16, 2020 at 10:00 AM Thomas Weise <t...@apache.org> wrote: > Hi Hannah, > > Thanks for investigating! > > I think it would be great to eliminate the overhead for local builds (by > default turn off the license assembly) and make it as lightweight > as possible in the frequent CI path. > > Thomas > > > On Thu, Apr 16, 2020 at 1:37 AM Hannah Jiang <hannahji...@google.com> > wrote: > >> I tried to check if urls are valid instead of pulling the files and it >> reduced only 1 min of running time. So, it's not an option here. >> >> I tried with multi processing and it improved the performance a lot. >> With 12 subprocesses, the running time reduced to 49 seconds, and with 16 >> cores, it reduced to 18 seconds. >> The number of subprocesses is defined by the number of cores, and Jenkins >> machine has 16 cores. >> FYI: with my local machine (12 cores) and home network, it takes 1min 40 >> secs to create a Java docker image. >> >> The caching approach mentioned by Robert brings many benefits, not only >> to this use case. >> However, we would like to include this work as part of 2.21.0, so I will >> move with the multi processing approach this time. >> >> Please let me know if you have objections. >> >> >> On Wed, Apr 15, 2020 at 4:01 PM Robert Bradshaw <rober...@google.com> >> wrote: >> >>> Is the cost primarily in pulling these remote licenses/sources? I'd >>> guess that 99.9% of the URLs remain the same from run to run. Would a >>> simple cache, or caching proxy, be sufficient? >>> >>> Otherwise, a tag to check that licenses can be pulled, but not really >>> pull them, might be sufficient. (Making sure the default is cheap but we >>> don't accidentally omit them when it matters is the tricky bit I see here.) >>> >>> >>> On Wed, Apr 15, 2020 at 3:38 PM Hannah Jiang <hannahji...@google.com> >>> wrote: >>> >>>> Thanks for providing feedback. >>>> >>>> Here is what happending now and I would discuss when to run the job. >>>> >>>> *Why it takes 7-8 mins for Java?* >>>> When we list dependencies both in runtime and compile environment, >>>> there are almost 1400 third party dependencies and we need to pull >>>> licenses/notices for all of them. >>>> In addition, we need to pull source code if license is CDDL, MPL, GLP >>>> or LGPL. 69 of the dependencies need to pull the source code as of >>>> 4/14/2020. >>>> Getting dependency list + pulling licenses/notices/source code takes >>>> 7-8 minutes. >>>> >>>> Now I see there are *two patterns of failures*. >>>> 1. In valid URLs. In fact, the urls are not invalid, but occationally, >>>> it returns URLError. This can be resolved by adding retries. However, it >>>> will add runtime to the job. >>>> 2. No artifacts available. Sometimes, when a new version of package is >>>> released and the plugin still looks for staging location. For example, new >>>> zetasql packages were released on 4/14, and today I saw several failures >>>> with looking for staging repo. The behavior is not consistent, sometimes it >>>> scans correct location, sometimes not. This can be resolved by running the >>>> job again. >>>> >>>> *When the job is running?* >>>> generateThirdPartyLicenses is added to :sdks:java:container and it is >>>> an upstream of the docker task. As such, whenever a docker is created, the >>>> job is triggered. >>>> :sdks:java:container:docker is added to Java PreSubmit job. >>>> >>>> *How to improve it?* >>>> According to some ideas provided above, how about doing this? >>>> Introduce a tag (ie: pull-licenses) to docker job to decide if pull the >>>> files. Default tag is NOT setting pull-licenses. >>>> When pull-licenses is not set, it checks if the licenses/notices/source >>>> code can be pull automaticall or they have urls to pull from, but don't >>>> really pull. >>>> When pull-license is set, files are pulled. >>>> >>>> For each PR (Presubmit): applying default option. The test would fail >>>> if the files cannot be pulled, so committers still need to fix dependency >>>> errors. I believe it would reduce the running time. >>>> For release: set the tag and pull the files and source code. Since it >>>> is checked for each PR, pulling should finish without problems. >>>> >>>> Please let me know what you think and if there are other things can be >>>> improved. >>>> >>>> Hannah >>>> >>>> >>>> >>>> On Wed, Apr 15, 2020 at 2:30 PM Kyle Weaver <kcwea...@google.com> >>>> wrote: >>>> >>>>> Looks like the same error as this Jira: >>>>> https://issues.apache.org/jira/browse/BEAM-9764 >>>>> >>>>> Even if/when we are able to fix this particular issue, I agree it is >>>>> best not to run this job except for releases because of the inherent >>>>> network cost and possible reliability issues. +Hannah Jiang >>>>> <hannahji...@google.com> What do you think? >>>>> >>>>> On Wed, Apr 15, 2020 at 5:20 PM Thomas Weise <t...@apache.org> wrote: >>>>> >>>>>> The new feature to assemble licenses is very useful but appears to >>>>>> add several minutes (7-8?) build time to jobs that need to build a >>>>>> container. >>>>>> >>>>>> Does it also seem to cause occasional build failures? >>>>>> >>>>>> >>>>>> https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Phrase/131/ >>>>>> >>>>>> Would it be possible to perform this task only during release builds? >>>>>> >>>>>> Thanks, >>>>>> Thomas >>>>>> >>>>>>