Re: [PR] Normalize tf record io [beam]

via GitHub Fri, 28 Mar 2025 13:54:18 -0700


damccorm commented on code in PR #34411:
URL: https://github.com/apache/beam/pull/34411#discussion_r2018825431



##########
sdks/python/apache_beam/yaml/yaml_io.py:
##########
@@ -574,3 +578,69 @@ def write_to_iceberg(
 def io_providers():
   return yaml_provider.load_providers(
       yaml_utils.locate_data_file('standard_io.yaml'))
+
+
+def read_from_tfrecord(
+    file_pattern: str,
+    coder: Optional[coders.BytesCoder] = coders.BytesCoder(),
+    compression_type: Optional[CompressionTypes] = None,
+    validate: Optional[bool] = None):
+  """Reads data from TFRecord.
+
+  Args:
+    file_pattern (str): A file glob pattern to read TFRecords from.
+    coder (coders.BytesCoder): Coder used to decode each record.
+    compression_type (CompressionTypes): Used to handle compressed input files.
+      Default value is CompressionTypes.AUTO, in which case the file_path's
+      extension will be used to detect the compression.
+    validate (bool): Boolean flag to verify that the files exist during the 
+      pipeline creation time.
+  """
+  return ReadFromTFRecord(
+      file_pattern=file_pattern,
+      compression_type=compression_type,
+      validate=validate)
+
+
+def write_to_tfrecord(
+    file_path_prefix: str,
+    coder: Optional[coders.BytesCoder] = coders.BytesCoder(),
+    file_name_suffix: Optional[str] = None,
+    num_shards: Optional[int] = None,
+    shard_name_template: Optional[str] = None,
+    compression_type: Optional[str] = None,
+    no_spilling: Optional[bool] = None):
+  """Writes data to TFRecord.
+
+  public abstract Builder setNoSpilling(boolean value);
+  Args:
+    file_path_prefix: The file path to write to. The files written will begin
+      with this prefix, followed by a shard identifier (see num_shards), and
+      end in a common extension, if given by file_name_suffix.
+    coder: Coder used to encode each record.
+    file_name_suffix: Suffix for the files written.
+    num_shards: The number of files (shards) used for output. If not set, the
+      default value will be used.
+    shard_name_template: A template string containing placeholders for
+      the shard number and shard count. When constructing a filename for a
+      particular shard number, the upper-case letters 'S' and 'N' are
+      replaced with the 0-padded shard number and shard count respectively.
+      This argument can be '' in which case it behaves as if num_shards was
+      set to 1 and only one file will be generated. The default pattern used
+      is '-SSSSS-of-NNNNN' if None is passed as the shard_name_template.
+    compression_type: Used to handle compressed output files. Typical value
+      is CompressionTypes.AUTO, in which case the file_path's extension will
+      be used to detect the compression.
+    no_splling: Used to skip the spilling of data caused by having 
+      maxNumWritersPerBundle.
+
+  Returns:
+    A WriteToTFRecord transform object.
+  """
+  return WriteToTFRecord(

Review Comment:
   Because WriteToTFRecord takes in a PCollection of bytes, but yaml returns a 
PCollection of rows, we need a small conversion layer here - similar to 
https://github.com/apache/beam/blob/d030e3f2e644205705a033bab7275229508b420a/sdks/python/apache_beam/yaml/yaml_io.py#L70
   
   We will probably need a similar construct for reading to map it to rows



##########
sdks/java/core/src/main/java/org/apache/beam/sdk/io/TFRecordReadSchemaTransformConfiguration.java:
##########
@@ -0,0 +1,111 @@
+/*

Review Comment:
   Once this is ready for full review (IMO, this is probably now if we can add 
a few integration tests). I'd recommend doing a separate PR for read/write 
since they're not really tied together at all. That will make it easier to 
review/iterate. It also will hopefully unblock the read review while dealing 
with the issues you're seeing on write



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@beam.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Normalize tf record io [beam]

Reply via email to