Fokko commented on a change in pull request #3560: [AIRFLOW-2697] Drop snakebite in favour of hdfs3 URL: https://github.com/apache/incubator-airflow/pull/3560#discussion_r210182865
########## File path: airflow/sensors/hdfs_sensor.py ########## @@ -17,103 +17,231 @@ # specific language governing permissions and limitations # under the License. -import re -import sys -from builtins import str +import posixpath from airflow import settings -from airflow.hooks.hdfs_hook import HDFSHook +from airflow.hooks.hdfs_hook import HdfsHook from airflow.sensors.base_sensor_operator import BaseSensorOperator from airflow.utils.decorators import apply_defaults -from airflow.utils.log.logging_mixin import LoggingMixin -class HdfsSensor(BaseSensorOperator): - """ - Waits for a file or folder to land in HDFS +class HdfsFileSensor(BaseSensorOperator): + """Sensor that waits for files matching a specific (glob) pattern to land in HDFS. + + :param str file_pattern: Glob pattern to match. + :param str conn_id: Connection to use. + :param Iterable[FilePathFilter] filters: Optional list of filters that can be + used to apply further filtering to any file paths matching the glob pattern. + Any files that fail a filter are dropped from consideration. + :param int min_size: Minimum size (in MB) for files to be considered. Can be used + to filter any intermediate files that are below the expected file size. + :param Set[str] ignore_exts: File extensions to ignore. By default, files with + a '_COPYING_' extension are ignored, as these represent temporary files. Review comment: Good point @XD-DENG We could also trim the prepended `.` from the extension to make both situations work. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
