[
https://issues.apache.org/jira/browse/BEAM-14315?focusedWorklogId=774731&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-774731
]
ASF GitHub Bot logged work on BEAM-14315:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 25/May/22 17:55
Start Date: 25/May/22 17:55
Worklog Time Spent: 10m
Work Description: Abacn commented on code in PR #17604:
URL: https://github.com/apache/beam/pull/17604#discussion_r881959947
##########
sdks/python/apache_beam/io/avroio.py:
##########
@@ -176,20 +181,70 @@ def __init__(
name and the value being the actual data. If False, it only returns
the data.
"""
- source_from_file = partial(
+ self._source_from_file = partial(
_create_avro_source, min_bundle_size=min_bundle_size)
- self._read_all_files = filebasedsource.ReadAllFiles(
+ self._desired_bundle_size = desired_bundle_size
+ self._min_bundle_size = min_bundle_size
+ self._with_filename = with_filename
+ self.label = label
+
+ def _set_read_all_files(self):
+ """Helper function to set _read_all_files PTransform in constructor."""
+ return filebasedsource.ReadAllFiles(
True,
CompressionTypes.AUTO,
- desired_bundle_size,
- min_bundle_size,
- source_from_file,
- with_filename)
-
- self.label = label
+ self._desired_bundle_size,
+ self._min_bundle_size,
+ self._source_from_file,
+ self._with_filename)
def expand(self, pvalue):
- return pvalue | self.label >> self._read_all_files
+ return pvalue | self.label >> self._set_read_all_files()
+
+
+class ReadAllFromAvroContinuously(ReadAllFromAvro):
+ """A ``PTransform`` for reading avro files in given file patterns.
+ This PTransform acts as a Source and produces continuously a ``PCollection``
+ of strings.
+
+ For more details, see ``ReadAllFromAvro`` for avro parsing settings;
+ see ``apache_beam.io.fileio.MatchContinuously`` for watching settings.
+
+ ReadAllFromAvroContinuously is experimental. No backwards-compatibility
+ guarantees. Due to the limitation on Reshuffle, current implementation does
+ not scale.
+ """
+ def __init__(self, file_pattern, label='ReadAllFilesContinuously', **kwargs):
+ """Initialize the ``ReadAllFromAvroContinuously`` transform.
+
+ Accepts args for constructor args of both ``ReadAllFromAvro`` and
+ ``apache_beam.io.fileio.MatchContinuously``.
+ """
+ kwargs_for_match = {
+ k: v
+ for (k, v) in kwargs.items()
+ if k in filebasedsource.ReadAllFilesContinuously.ARGS_FOR_MATCH
+ }
+ kwargs_for_read = {
+ k: v
+ for (k, v) in kwargs.items()
+ if k not in filebasedsource.ReadAllFilesContinuously.ARGS_FOR_MATCH
Review Comment:
Yeah I agree it sounds weird. The consideration was to avoid re-assign the
default variables of both ReadAllFromAvro and ReadAllFilesContinuously. Another
choice is to parse in a MatchContinuously instance as a parameter to
ReadAllFromAvroContinuously, but seems an anti-pattern of the python pipeline
syntaxes.
Since ReadAllFromAvroContinuously is merely a combination of
MatchContinuously, I think we can avoid creating this new PTransform in api at
all, and just add documentation for the use case that use these two transforms
to implement Read All From Avro Continuously functionality.
Issue Time Tracking
-------------------
Worklog Id: (was: 774731)
Time Spent: 1h 50m (was: 1h 40m)
> Update fileio.MatchContinuously to allow reading already read files with a
> new timestamp
> ----------------------------------------------------------------------------------------
>
> Key: BEAM-14315
> URL: https://issues.apache.org/jira/browse/BEAM-14315
> Project: Beam
> Issue Type: New Feature
> Components: io-py-common
> Reporter: Yi Hu
> Assignee: Yi Hu
> Priority: P2
> Time Spent: 1h 50m
> Remaining Estimate: 0h
>
> This will be the Python counterpart of BEAM-14267.
> For fileio.MatchContinuously, we want to add an option to choose to consider
> a file new if it has a different timestamp from an existing file, even if the
> file itself has the same name.
> See the following design doc for more detail:
> https://docs.google.com/document/d/1xnacyLGNh6rbPGgTAh5D1gZVR8rHUBsMMRV3YkvlL08/edit?usp=sharing&resourcekey=0-be0uF-DdmwAz6Vg4Li9FNw
--
This message was sent by Atlassian Jira
(v8.20.7#820007)