[GitHub] [airflow] vincbeck commented on a diff in pull request #22737: Deprecate S3PrefixSensor

GitBox Wed, 06 Apr 2022 10:07:36 -0700


vincbeck commented on code in PR #22737:
URL: https://github.com/apache/airflow/pull/22737#discussion_r844186593



##########
airflow/providers/amazon/aws/sensors/s3.py:
##########
@@ -78,27 +80,32 @@ def __init__(
     ):
         super().__init__(**kwargs)
         self.bucket_name = bucket_name
-        self.bucket_key = bucket_key
+        self.bucket_key = [bucket_key] if isinstance(bucket_key, str) else 
bucket_key
         self.wildcard_match = wildcard_match
         self.aws_conn_id = aws_conn_id
         self.verify = verify
         self.hook: Optional[S3Hook] = None
 
-    def _resolve_bucket_and_key(self):
+    def _resolve_bucket_and_key(self, key):
         """If key is URI, parse bucket"""
         if self.bucket_name is None:
-            self.bucket_name, self.bucket_key = 
S3Hook.parse_s3_url(self.bucket_key)
+            return S3Hook.parse_s3_url(key)
         else:
-            parsed_url = urlparse(self.bucket_key)
+            parsed_url = urlparse(key)
             if parsed_url.scheme != '' or parsed_url.netloc != '':
                 raise AirflowException('If bucket_name provided, bucket_key 
must be relative path, not URI.')
+            return self.bucket_name, key
 
-    def poke(self, context: 'Context'):
-        self._resolve_bucket_and_key()
-        self.log.info('Poking for key : s3://%s/%s', self.bucket_name, 
self.bucket_key)
+    def _key_exists(self, key):
+        bucket_name, key = self._resolve_bucket_and_key(key)
+        self.log.info('Poking for key : s3://%s/%s', bucket_name, key)
         if self.wildcard_match:
-            return self.get_hook().check_for_wildcard_key(self.bucket_key, 
self.bucket_name)
-        return self.get_hook().check_for_key(self.bucket_key, self.bucket_name)
+            return self.get_hook().check_for_wildcard_key(key, bucket_name)
+
+        return self.get_hook().check_for_key(key, bucket_name)
+
+    def poke(self, context: 'Context'):
+        return all(self._key_exists(key) for key in self.bucket_key)

Review Comment:
   The only way to optimize it would be to use `ListObjects` instead but there 
are some caveats.
   1. `ListObjects` works for a given bucket. So If you want check 300 
different keys, all of them in different bucket, it would make the problem worse
   2. It would make the code more complex because by using `ListObjects` you 
need to handle the different pages, making the necessary checks that the file 
is returned as part of `ListObjects`
   3. By looking really quick on how current users use `S3PrefixSensor` 
(https://github.com/search?q=S3PrefixSensor&type=code), most of it are single 
file usage
   
   Because of these points, in my opinion, it's not worth using `ListObjects` 
instead of `HeadObject`
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] vincbeck commented on a diff in pull request #22737: Deprecate S3PrefixSensor

Reply via email to