[GitHub] [arrow] pitrou commented on a change in pull request #11031: ARROW-13794: [C++] Deprecate PARQUET_VERSION_2_0

GitBox Tue, 31 Aug 2021 08:21:39 -0700


pitrou commented on a change in pull request #11031:
URL: https://github.com/apache/arrow/pull/11031#discussion_r699095404




##########
File path: python/pyarrow/_parquet.pxd
##########
@@ -133,7 +133,9 @@ cdef extern from "parquet/api/schema.h" namespace "parquet" 
nogil:
 
     enum ParquetVersion" parquet::ParquetVersion::type":
         ParquetVersion_V1" parquet::ParquetVersion::PARQUET_1_0"
-        ParquetVersion_V2" parquet::ParquetVersion::PARQUET_2_0"
+        ParquetVersion_V2_0" parquet::ParquetVersion::PARQUET_2_0"
+        ParquetVersion_V2_4" parquet::ParquetVersion::PARQUET_2_4"
+        ParquetVersion_V2_6" parquet::ParquetVersion::PARQUET_2_6"

Review comment:
       For now it's not being used in the PyArrow code. We can expose it when 
needed.

##########
File path: cpp/src/parquet/column_writer.cc
##########
@@ -1805,6 +1805,8 @@ Status WriteTimestamps(const ::arrow::Array& values, 
int64_t num_levels,
         maybe_parent_nulls);
   };
 
+  const auto version = writer->properties()->version();

Review comment:
       Done.

##########
File path: cpp/src/parquet/metadata.h
##########
@@ -276,7 +276,11 @@ class PARQUET_EXPORT FileMetaData {
   /// \throws ParquetException if the index is out of bound.
   std::unique_ptr<RowGroupMetaData> RowGroup(int index) const;
 
-  /// \brief Return the version of the file.
+  /// \brief Return the "version" of the file
+  ///
+  /// The Parquet file metadata stores the version as a single integer,
+  /// therefore the returned value can only be an approximation of the
+  /// Parquet format version actually supported by the producer of the file.

Review comment:
       Wow. Thanks for the pointer. We may want to deprecate this API as a 
separate JIRA, then...

##########
File path: cpp/src/parquet/type_fwd.h
##########
@@ -19,8 +19,50 @@
 
 namespace parquet {
 
+/// \brief Feature selection when writing Parquet files
+///
+/// `ParquetVersion::type` governs which data types are allowed and how they
+/// are represented. For example, uint32_t data will be written differently
+/// depending on this value (as INT64 for PARQUET_1_0, as UINT32 for other
+/// versions).
+///
+/// However, some features - such as compression algorithms, encryption,
+/// or the improved "v2" data page format - must be enabled separately in
+/// ArrowWriterProperties.
 struct ParquetVersion {
-  enum type { PARQUET_1_0, PARQUET_2_0 };
+  enum type : int {
+    /// Enable only pre-2.0 Parquet format features when writing
+    ///
+    /// This setting is useful for maximum compatibility with legacy readers.
+    PARQUET_1_0,

Review comment:
       Right. The problem is that, as for PARQUET_2_0, the set of features 
enabled does not map to a well-defined version. That said, I may write 
"pre-2.2" instead.

##########
File path: python/pyarrow/parquet.py
##########
@@ -492,16 +492,18 @@ def _sanitize_table(table, new_schema, flavor):
         return table
 
 
-_parquet_writer_arg_docs = """version : {"1.0", "2.0"}, default "1.0"
+_parquet_writer_arg_docs = """version : {"1.0", "2.4", "2.6"}, default "1.0"
     Determine which Parquet logical types are available for use, whether the
     reduced set from the Parquet 1.x.x format or the expanded logical types
-    added in format version 2.0.0 and after. Note that files written with
-    version='2.0' may not be readable in all Parquet implementations, so
-    version='1.0' is likely the choice that maximizes file compatibility. Some
-    features, such as lossless storage of nanosecond timestamps as INT64
-    physical storage, are only available with version='2.0'. The Parquet 2.0.0
-    format version also introduced a new serialized data page format; this can
-    be enabled separately using the data_page_version option.
+    added in later format versions.
+    Files written with version='2.4' or '2.6' may not be readable in all
+    Parquet implementations, so version='1.0' is likely the choice that
+    maximizes file compatibility.
+    Logical types and UINT32 are only available with version '2.4'.

Review comment:
       Hmm, thanks. This is a mess.

##########
File path: python/pyarrow/_parquet.pxd
##########
@@ -133,7 +133,9 @@ cdef extern from "parquet/api/schema.h" namespace "parquet" 
nogil:
 
     enum ParquetVersion" parquet::ParquetVersion::type":
         ParquetVersion_V1" parquet::ParquetVersion::PARQUET_1_0"
-        ParquetVersion_V2" parquet::ParquetVersion::PARQUET_2_0"
+        ParquetVersion_V2_0" parquet::ParquetVersion::PARQUET_2_0"
+        ParquetVersion_V2_4" parquet::ParquetVersion::PARQUET_2_4"
+        ParquetVersion_V2_6" parquet::ParquetVersion::PARQUET_2_6"

Review comment:
       I don't know. As you say, it may not be good practice. If people need 
features from a specific version, they should probably use that version?
   

##########
File path: python/pyarrow/parquet.py
##########
@@ -511,11 +513,12 @@ def _sanitize_table(table, new_schema, flavor):
 coerce_timestamps : str, default None
     Cast timestamps a particular resolution. The defaults depends on `version`.
     For ``version='1.0'`` (the default), nanoseconds will be cast to
-    microseconds ('us'), and seconds to milliseconds ('ms') by default. For
-    ``version='2.0'``, the original resolution is preserved and no casting
-    is done by default. The casting might result in loss of data, in which
-    case ``allow_truncated_timestamps=True`` can be used to suppress the
-    raised exception.
+    microseconds ('us'), and seconds to milliseconds ('ms') by default.
+    For ``version='2.4'``, nanoseconds will be cast to microseconds.

Review comment:
       Ah, you're right. Parquet doesn't have seconds.

##########
File path: python/pyarrow/parquet.py
##########
@@ -511,11 +513,12 @@ def _sanitize_table(table, new_schema, flavor):
 coerce_timestamps : str, default None
     Cast timestamps a particular resolution. The defaults depends on `version`.
     For ``version='1.0'`` (the default), nanoseconds will be cast to
-    microseconds ('us'), and seconds to milliseconds ('ms') by default. For
-    ``version='2.0'``, the original resolution is preserved and no casting
-    is done by default. The casting might result in loss of data, in which
-    case ``allow_truncated_timestamps=True`` can be used to suppress the
-    raised exception.
+    microseconds ('us'), and seconds to milliseconds ('ms') by default.
+    For ``version='2.4'``, nanoseconds will be cast to microseconds.
+    For ``version='2.6'``, the original resolution is always preserved.

Review comment:
       Indeed, seconds are still cast.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] pitrou commented on a change in pull request #11031: ARROW-13794: [C++] Deprecate PARQUET_VERSION_2_0

Reply via email to