robertwb commented on code in PR #23808:
URL: https://github.com/apache/beam/pull/23808#discussion_r1021908251


##########
sdks/python/apache_beam/io/parquetio.py:
##########
@@ -448,6 +451,14 @@ def __init__(
         is '-SSSSS-of-NNNNN' if None is passed as the shard_name_template.
       mime_type: The MIME type to use for the produced files, if the filesystem
         supports specifying MIME types.
+      max_records_per_shard: Maximum number of records to write to any
+        individual shard.
+      max_bytes_per_shard: Target maximum number of bytes to write to any

Review Comment:
   It's impossible to always avoid, as a single record may exceed the 
bytes-per-shard limit. Even if this is not the case, some file formats have a 
footer/trailer (e.g. checksums, listings, indices, delimiters...) that would 
put one over the limit even if the written records were under, so it's really 
file-format-dependent on how this can be achieved. 
   
   The intent here is that shard not be "too big" which is generally flexible. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to