tvalentyn commented on a change in pull request #16633:
URL: https://github.com/apache/beam/pull/16633#discussion_r794221581



##########
File path: sdks/python/apache_beam/options/pipeline_options.py
##########
@@ -1053,7 +1053,21 @@ def _add_argparse_args(cls, parser):
         default=None,
         help=(
             'Path to a folder to cache the packages specified in '
-            'the requirements file using the --requirements_file option.'))
+            'the requirements file using the --requirements_file option.'
+            'If you want to skip populating requirements cache, please '
+            'specify --requirements_cache skip. This would install all'

Review comment:
       I would remove the last sentence.

##########
File path: sdks/python/apache_beam/runners/portability/stager.py
##########
@@ -663,10 +683,21 @@ def remove_dependency_from_requirements(
 
     return tmp_requirements_filename
 
+  @staticmethod
+  def _get_manylinux_distribution():
+    # TODO(anandinguva): When https://github.com/pypa/pip/issues/10760 is
+    # addressed download wheel based on glib version in Beam's Python Base 
image
+    pip_version = pkg_resources.get_distribution('pip').version
+    if float(pip_version[0:4]) >= 19.3:

Review comment:
       This is not reliable, for example it would fail on pip < 10; let's use 
an off-the-shelf helper that knows how to parse 
    and compare versions, see: `pkg_resources.parse_version` or  
`packaging.version.parse`

##########
File path: sdks/python/apache_beam/runners/portability/stager_test.py
##########
@@ -723,6 +765,65 @@ def test_remove_dependency_from_requirements(self):
     self.assertEqual(['apache_beam\n', 'avro-python3\n', 'numpy\n'],
                      sorted(lines))
 
+  def _create_file(

Review comment:
       ```suggestion
     def _populate_requitements_cache_fake(
   ```

##########
File path: sdks/python/apache_beam/options/pipeline_options.py
##########
@@ -1053,7 +1053,21 @@ def _add_argparse_args(cls, parser):
         default=None,
         help=(
             'Path to a folder to cache the packages specified in '
-            'the requirements file using the --requirements_file option.'))
+            'the requirements file using the --requirements_file option.'
+            'If you want to skip populating requirements cache, please '
+            'specify --requirements_cache skip. This would install all'

Review comment:
       ```suggestion
               'specify --requirements_cache="skip". This would install all'
   ```

##########
File path: sdks/python/apache_beam/runners/portability/stager.py
##########
@@ -663,10 +683,21 @@ def remove_dependency_from_requirements(
 
     return tmp_requirements_filename
 
+  @staticmethod
+  def _get_manylinux_distribution():
+    # TODO(anandinguva): When https://github.com/pypa/pip/issues/10760 is

Review comment:
       Add a comment explaining that: 
   pip is still expected to download compatible wheels for older platforms 
(such as manylinux1) if newer wheels (say, manylinux2014) are not available. 
You can also link a resource that describes this.

##########
File path: sdks/python/apache_beam/options/pipeline_options.py
##########
@@ -1053,7 +1053,21 @@ def _add_argparse_args(cls, parser):
         default=None,
         help=(
             'Path to a folder to cache the packages specified in '
-            'the requirements file using the --requirements_file option.'))
+            'the requirements file using the --requirements_file option.'
+            'If you want to skip populating requirements cache, please '
+            'specify --requirements_cache skip. This would install all'
+            'the packages from requirements file on the worker.'))
+    parser.add_argument(
+        '--requirements_cache_only_sources',
+        action='store_true',
+        help=(
+            'Enable this flag to populate requirements cache with Source'

Review comment:
       ```suggestion
               'Enable this flag to populate requirements cache only with 
Source'
   ```

##########
File path: sdks/python/apache_beam/runners/portability/stager.py
##########
@@ -663,10 +683,21 @@ def remove_dependency_from_requirements(
 
     return tmp_requirements_filename
 
+  @staticmethod
+  def _get_manylinux_distribution():

Review comment:
       possible naming suggestion:
   ```suggestion
     def _get_platform_for_default_sdk_container():
   ```

##########
File path: sdks/python/apache_beam/runners/portability/stager_test.py
##########
@@ -723,6 +765,65 @@ def test_remove_dependency_from_requirements(self):
     self.assertEqual(['apache_beam\n', 'avro-python3\n', 'numpy\n'],
                      sorted(lines))
 
+  def _create_file(
+      self, requirements_file, temp_dir, populate_cache_with_sdists):
+    if not populate_cache_with_sdists:
+      self.create_temp_file(os.path.join(temp_dir, 'nothing.whl'), 'Fake whl')
+      self.create_temp_file(
+          os.path.join(temp_dir, 'nothing.tar.gz'), 'Fake tarball')
+    else:
+      self.create_temp_file(
+          os.path.join(temp_dir, 'nothing.tar.gz'), 'Fake tarball')

Review comment:
       appears in both branches

##########
File path: sdks/python/apache_beam/runners/portability/stager.py
##########
@@ -205,8 +210,9 @@ def create_job_resources(options,  # type: PipelineOptions
     # if we know we are using a dependency pre-installed sdk container image.
     if not skip_prestaged_dependencies:
       requirements_cache_path = (
-          os.path.join(tempfile.gettempdir(), 'dataflow-requirements-cache')
-          if setup_options.requirements_cache is None else
+          os.path.join(tempfile.gettempdir(), 'dataflow-requirements-cache') if
+          (setup_options.requirements_cache is None) and
+          (setup_options.requirements_cache != SKIP_REQUIREMENTS_CACHE) else

Review comment:
       `and...` part seems not necessary

##########
File path: sdks/python/container/piputil.go
##########
@@ -51,7 +51,7 @@ func pipInstallRequirements(files []string, dir, name string) 
error {
                        // used without following their dependencies.
                        args := []string{"install", "-r", filepath.Join(dir, 
name), "--disable-pip-version-check", "--no-index", "--no-deps", 
"--find-links", dir}
                        if err := execx.Execute(pip, args...); err != nil {
-                               return err
+                       fmt.Println("Requirements cache " + dir + " is empty. 
Downloading packages from PyPI")

Review comment:
       Also, let's see if we can have a simple integration test for the common 
(binary), scenario that will also exercise boot.go code, a close example is 
https://github.com/apache/beam/blob/0830a02f8984383028a74ca9b89b9e3dc9f1597c/sdks/python/test-suites/portable/common.gradle#L78.
 
   

##########
File path: sdks/python/container/piputil.go
##########
@@ -51,7 +51,7 @@ func pipInstallRequirements(files []string, dir, name string) 
error {
                        // used without following their dependencies.
                        args := []string{"install", "-r", filepath.Join(dir, 
name), "--disable-pip-version-check", "--no-index", "--no-deps", 
"--find-links", dir}
                        if err := execx.Execute(pip, args...); err != nil {
-                               return err
+                       fmt.Println("Requirements cache " + dir + " is empty. 
Downloading packages from PyPI")

Review comment:
       not necessarily empty 

##########
File path: sdks/python/apache_beam/runners/portability/stager_test.py
##########
@@ -342,6 +343,47 @@ def test_with_requirements_file_and_cache(self):
     self.assertTrue(os.path.isfile(os.path.join(staging_dir, 'abc.txt')))
     self.assertTrue(os.path.isfile(os.path.join(staging_dir, 'def.txt')))
 
+  def test_with_requirements_file_skipping_cache(self):
+    staging_dir = self.make_temp_dir()
+    source_dir = self.make_temp_dir()
+
+    options = PipelineOptions()
+    self.update_options(options)
+    options.view_as(SetupOptions).requirements_file = os.path.join(
+        source_dir, stager.REQUIREMENTS_FILE)
+    options.view_as(
+        SetupOptions).requirements_cache = stager.SKIP_REQUIREMENTS_CACHE
+    self.create_temp_file(
+        os.path.join(source_dir, stager.REQUIREMENTS_FILE), 'nothing')
+
+    resources = self.stager.create_and_stage_job_resources(
+        options,
+        populate_requirements_cache=self.populate_requirements_cache,

Review comment:
       alternatively, you could pass a mock method and verify that it was never 
called.

##########
File path: sdks/python/apache_beam/options/pipeline_options.py
##########
@@ -1053,7 +1053,21 @@ def _add_argparse_args(cls, parser):
         default=None,
         help=(
             'Path to a folder to cache the packages specified in '
-            'the requirements file using the --requirements_file option.'))
+            'the requirements file using the --requirements_file option.'
+            'If you want to skip populating requirements cache, please '
+            'specify --requirements_cache skip. This would install all'
+            'the packages from requirements file on the worker.'))
+    parser.add_argument(
+        '--requirements_cache_only_sources',
+        action='store_true',
+        help=(
+            'Enable this flag to populate requirements cache with Source'
+            'distributions(sdists) of the dependencies mentioned in the '
+            '--requirements_file'
+            'Note: This step would slow down the worker startup time'

Review comment:
       Note (BEAM-XXXX): This flag may significantly slow down the pipeline 
submission. It is added to preserve the requirements cache behavior prior to 
2.37.0 and will likely be removed in future releases.

##########
File path: sdks/python/apache_beam/runners/portability/stager_test.py
##########
@@ -342,6 +343,47 @@ def test_with_requirements_file_and_cache(self):
     self.assertTrue(os.path.isfile(os.path.join(staging_dir, 'abc.txt')))
     self.assertTrue(os.path.isfile(os.path.join(staging_dir, 'def.txt')))
 
+  def test_with_requirements_file_skipping_cache(self):

Review comment:
       it's possible to rename the test method to show what's being tested and 
expected outcome, this makes test cases easier to follow. for example:
    
   test_requirements_cache_not_populated_when_cache_disabled

##########
File path: sdks/python/apache_beam/runners/portability/stager.py
##########
@@ -663,10 +683,21 @@ def remove_dependency_from_requirements(
 
     return tmp_requirements_filename
 
+  @staticmethod
+  def _get_manylinux_distribution():
+    # TODO(anandinguva): When https://github.com/pypa/pip/issues/10760 is
+    # addressed download wheel based on glib version in Beam's Python Base 
image

Review comment:
       ```suggestion
       # addressed, download wheel based on glibc version in Beam's Python Base 
image
   ```

##########
File path: sdks/python/container/piputil.go
##########
@@ -51,7 +51,7 @@ func pipInstallRequirements(files []string, dir, name string) 
error {
                        // used without following their dependencies.
                        args := []string{"install", "-r", filepath.Join(dir, 
name), "--disable-pip-version-check", "--no-index", "--no-deps", 
"--find-links", dir}
                        if err := execx.Execute(pip, args...); err != nil {
-                               return err
+                       fmt.Println("Requirements cache " + dir + " is empty. 
Downloading packages from PyPI")

Review comment:
       You could say: Some packages could not be installed from the 
requirements cache and may now be downloaded from PyPI.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to