[
https://issues.apache.org/jira/browse/BEAM-14314?focusedWorklogId=765142&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-765142
]
ASF GitHub Bot logged work on BEAM-14314:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 02/May/22 21:04
Start Date: 02/May/22 21:04
Worklog Time Spent: 10m
Work Description: pabloem commented on code in PR #17380:
URL: https://github.com/apache/beam/pull/17380#discussion_r863189335
##########
sdks/python/apache_beam/io/azure/blobstorageio.py:
##########
@@ -559,40 +569,54 @@ def _delete_batch(self, container, blobs):
@retry.with_exponential_backoff(
retry_filter=retry.retry_on_beam_io_error_filter)
- def list_prefix(self, path):
+ def list_prefix(self, path, with_metadata=False):
"""Lists files matching the prefix.
Args:
path: Azure Blob Storage file path pattern in the form
azfs://<storage-account>/<container>/[name].
+ with_metadata: Experimental. Specify whether returns file metadata.
Returns:
- Dictionary of file name -> size.
+ If ``with_metadata`` is False: dict of file name -> size; if
+ ``with_metadata`` is True: dict of file name -> tuple(size, timestamp).
"""
storage_account, container, blob = parse_azfs_path(
path, blob_optional=True, get_account=True)
- file_sizes = {}
+ file_info = {}
counter = 0
start_time = time.time()
- logging.info("Starting the size estimation of the input")
+ if with_metadata:
+ logging.info("Starting the file information of the input")
+ else:
+ logging.info("Starting the size estimation of the input")
container_client = self.client.get_container_client(container)
while True:
response = container_client.list_blobs(name_starts_with=blob)
for item in response:
file_name = "azfs://%s/%s/%s" % (storage_account, container, item.name)
- file_sizes[file_name] = item.size
+ if with_metadata:
+ file_info[file_name] = (
+ item.size, self._updated_to_seconds(item.last_modified))
Review Comment:
to be honest, I might prefer that we change this to always be a namedtuple
as a return value, instead of a Tuple-or-value depending on the arguments.
We don't need to change this now, as it's not officially a public API, but
in the future it may be worth doing to clean up - maybe add a JIRA issue with a
target version for Beam 3.0.0?
Issue Time Tracking
-------------------
Worklog Id: (was: 765142)
Time Spent: 4h 40m (was: 4.5h)
> Add last_updated field in filesystem.FileMetaData
> -------------------------------------------------
>
> Key: BEAM-14314
> URL: https://issues.apache.org/jira/browse/BEAM-14314
> Project: Beam
> Issue Type: New Feature
> Components: io-py-common
> Reporter: Yi Hu
> Assignee: Yi Hu
> Priority: P2
> Time Spent: 4h 40m
> Remaining Estimate: 0h
>
> This will be the python counterpart of BEAM-5910
> Per python naming convention, the field will be named as
> "last_updated_in_seconds".
--
This message was sent by Atlassian Jira
(v8.20.7#820007)