[
https://issues.apache.org/jira/browse/BEAM-14314?focusedWorklogId=758051&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-758051
]
ASF GitHub Bot logged work on BEAM-14314:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 18/Apr/22 18:22
Start Date: 18/Apr/22 18:22
Worklog Time Spent: 10m
Work Description: johnjcasey commented on code in PR #17380:
URL: https://github.com/apache/beam/pull/17380#discussion_r852311644
##########
sdks/python/apache_beam/io/azure/blobstorageio.py:
##########
@@ -372,29 +355,47 @@ def last_updated(self, path):
Returns: last updated time of the Azure Blob Storage blob
in seconds.
"""
- container, blob = parse_azfs_path(path)
- blob_to_check = self.client.get_blob_client(container, blob)
- try:
- properties = blob_to_check.get_blob_properties()
- except ResourceNotFoundError as e:
- message = e.reason
- code = e.status_code
- raise BlobStorageError(message, code)
+ return self._updated_to_seconds(self._blob_properties(path).last_modified)
- datatime = properties.last_modified
- return (
- time.mktime(datatime.timetuple()) - time.timezone +
- datatime.microsecond / 1000000.0)
-
- @retry.with_exponential_backoff(
- retry_filter=retry.retry_on_beam_io_error_filter)
def checksum(self, path):
"""Looks up the checksum of an Azure Blob Storage blob.
Args:
path: Azure Blob Storage file path pattern in the form
azfs://<storage-account>/<container>/[name].
"""
+ return self._blob_properties(path).properties.etag
+
+ def _vars(self, path):
Review Comment:
can we improve the name here too?
##########
sdks/python/apache_beam/io/filesystem.py:
##########
@@ -421,27 +421,44 @@ def __exit__(self, exception_type, exception_value,
traceback):
class FileMetadata(object):
- """Metadata about a file path that is the output of FileSystem.match."""
- def __init__(self, path, size_in_bytes):
+ """Metadata about a file path that is the output of FileSystem.match.
+
+ Fields:
+ path: [Required] file path.
+ size_in_bytes: [Required] file size in bytes.
+ last_updated_in_seconds: [Optional] last modified timestamp of the file, or
+ valued 0.0 if not specified.
Review Comment:
Does a timestamp in seconds match what we have in java? I would expect it to
be millis since epoch generally
##########
sdks/python/apache_beam/io/gcp/gcsio.py:
##########
@@ -532,39 +510,83 @@ def last_updated(self, path):
Returns: last updated time of the GCS object in second.
"""
+ return self._updated_to_seconds(self._gcs_object(path).updated)
+
+ def _vars(self, path):
Review Comment:
I may be missing context, but _vars doesn't tell me much about what this
method does. Can we get a better name here?
##########
sdks/python/apache_beam/io/aws/s3io.py:
##########
@@ -464,6 +459,47 @@ def exists(self, path):
# We re-raise all other exceptions
raise
+ def _vars(self, path):
Review Comment:
same here
##########
sdks/python/apache_beam/io/hadoopfilesystem.py:
##########
@@ -399,6 +407,26 @@ def checksum(self, url):
file_checksum[_FILE_CHECKSUM_BYTES],
)
+ def metadata(self, url):
+ """Fetch metadata fields of a file on the FileSystem.
+
+ Args:
+ url: string url of a file.
+
+ Returns:
+ :class:`~apache_beam.io.filesystem.FileMetadata`.
+ Note: last_updated field is not supported yet.
+
+ Raises:
+ ``BeamIOError``: if url doesn't exist.
+ """
+ _, path = self._parse_url(url)
+ status = self._hdfs_client.status(path, strict=False)
+ print(status)
+ if status is None:
+ raise BeamIOError('File not found: %s' % url)
+ return FileMetadata(url, status[_FILE_STATUS_LENGTH])
Review Comment:
does hdfs have a way to get the timestamp?
Issue Time Tracking
-------------------
Worklog Id: (was: 758051)
Time Spent: 1h (was: 50m)
> Add last_updated field in filesystem.FileMetaData
> -------------------------------------------------
>
> Key: BEAM-14314
> URL: https://issues.apache.org/jira/browse/BEAM-14314
> Project: Beam
> Issue Type: New Feature
> Components: io-py-common
> Reporter: Yi Hu
> Assignee: Yi Hu
> Priority: P2
> Time Spent: 1h
> Remaining Estimate: 0h
>
> This will be the python counterpart of BEAM-5910
> Per python naming convention, the field will be named as
> "last_updated_in_seconds".
--
This message was sent by Atlassian Jira
(v8.20.1#820001)