jorisvandenbossche commented on a change in pull request #11031:
URL: https://github.com/apache/arrow/pull/11031#discussion_r699168063
##########
File path: python/pyarrow/parquet.py
##########
@@ -511,11 +513,12 @@ def _sanitize_table(table, new_schema, flavor):
coerce_timestamps : str, default None
Cast timestamps a particular resolution. The defaults depends on `version`.
For ``version='1.0'`` (the default), nanoseconds will be cast to
- microseconds ('us'), and seconds to milliseconds ('ms') by default. For
- ``version='2.0'``, the original resolution is preserved and no casting
- is done by default. The casting might result in loss of data, in which
- case ``allow_truncated_timestamps=True`` can be used to suppress the
- raised exception.
+ microseconds ('us'), and seconds to milliseconds ('ms') by default.
+ For ``version='2.4'``, nanoseconds will be cast to microseconds.
+ For ``version='2.6'``, the original resolution is always preserved.
Review comment:
Maybe clarify that this sentence is only about nanoseconds? (I think
seconds are still cast to milliseconds?)
##########
File path: python/pyarrow/parquet.py
##########
@@ -492,16 +492,18 @@ def _sanitize_table(table, new_schema, flavor):
return table
-_parquet_writer_arg_docs = """version : {"1.0", "2.0"}, default "1.0"
+_parquet_writer_arg_docs = """version : {"1.0", "2.4", "2.6"}, default "1.0"
Determine which Parquet logical types are available for use, whether the
reduced set from the Parquet 1.x.x format or the expanded logical types
- added in format version 2.0.0 and after. Note that files written with
- version='2.0' may not be readable in all Parquet implementations, so
- version='1.0' is likely the choice that maximizes file compatibility. Some
- features, such as lossless storage of nanosecond timestamps as INT64
- physical storage, are only available with version='2.0'. The Parquet 2.0.0
- format version also introduced a new serialized data page format; this can
- be enabled separately using the data_page_version option.
+ added in later format versions.
+ Files written with version='2.4' or '2.6' may not be readable in all
+ Parquet implementations, so version='1.0' is likely the choice that
+ maximizes file compatibility.
+ Logical types and UINT32 are only available with version '2.4'.
Review comment:
We actually write logical types for version 1.0 ..
##########
File path: cpp/src/parquet/metadata.h
##########
@@ -276,7 +276,11 @@ class PARQUET_EXPORT FileMetaData {
/// \throws ParquetException if the index is out of bound.
std::unique_ptr<RowGroupMetaData> RowGroup(int index) const;
- /// \brief Return the version of the file.
+ /// \brief Return the "version" of the file
+ ///
+ /// The Parquet file metadata stores the version as a single integer,
+ /// therefore the returned value can only be an approximation of the
+ /// Parquet format version actually supported by the producer of the file.
Review comment:
Sidenote: It's also not very clear in general what this field is for
(cfr the core features discussion at
https://github.com/apache/parquet-format/pull/164#discussion_r569228238,
parquet-mr always uses 1)
##########
File path: python/pyarrow/parquet.py
##########
@@ -511,11 +513,12 @@ def _sanitize_table(table, new_schema, flavor):
coerce_timestamps : str, default None
Cast timestamps a particular resolution. The defaults depends on `version`.
For ``version='1.0'`` (the default), nanoseconds will be cast to
- microseconds ('us'), and seconds to milliseconds ('ms') by default. For
- ``version='2.0'``, the original resolution is preserved and no casting
- is done by default. The casting might result in loss of data, in which
- case ``allow_truncated_timestamps=True`` can be used to suppress the
- raised exception.
+ microseconds ('us'), and seconds to milliseconds ('ms') by default.
+ For ``version='2.4'``, nanoseconds will be cast to microseconds.
Review comment:
This didn't change compared to version=1.0 (the sentence before)?
##########
File path: cpp/src/parquet/type_fwd.h
##########
@@ -19,8 +19,50 @@
namespace parquet {
+/// \brief Feature selection when writing Parquet files
+///
+/// `ParquetVersion::type` governs which data types are allowed and how they
+/// are represented. For example, uint32_t data will be written differently
+/// depending on this value (as INT64 for PARQUET_1_0, as UINT32 for other
+/// versions).
+///
+/// However, some features - such as compression algorithms, encryption,
+/// or the improved "v2" data page format - must be enabled separately in
+/// ArrowWriterProperties.
struct ParquetVersion {
- enum type { PARQUET_1_0, PARQUET_2_0 };
+ enum type : int {
+ /// Enable only pre-2.0 Parquet format features when writing
+ ///
+ /// This setting is useful for maximum compatibility with legacy readers.
+ PARQUET_1_0,
Review comment:
I don't know into what detail we want to go here, but this is not
"exactly" correct that it only enables pre-2.0 features. We currently already
do write for example TIMESTAMP_MILLIS with `version="1.0"`, although strictly
speaking this was only introduced in Parquet format 2.2. I think it are only
the several integer logical types that we don't write with version 1.0 (as
those can be interpreted incorrectly if you don't understand the type
annotation, since they are physically stored as int32 or int64; which is what
matters most for compatibility with older readers).
So I think from a practical point of view, you can ignore the above nitpick
:)
##########
File path: python/pyarrow/_parquet.pxd
##########
@@ -133,7 +133,9 @@ cdef extern from "parquet/api/schema.h" namespace "parquet"
nogil:
enum ParquetVersion" parquet::ParquetVersion::type":
ParquetVersion_V1" parquet::ParquetVersion::PARQUET_1_0"
- ParquetVersion_V2" parquet::ParquetVersion::PARQUET_2_0"
+ ParquetVersion_V2_0" parquet::ParquetVersion::PARQUET_2_0"
+ ParquetVersion_V2_4" parquet::ParquetVersion::PARQUET_2_4"
+ ParquetVersion_V2_6" parquet::ParquetVersion::PARQUET_2_6"
Review comment:
From a user point of view, would it be useful to have this though? That
you can specify `version="latest"` or `version="2.x"` or something to always
get the latest version? (although that might not be good practice to do anyway)
##########
File path: python/pyarrow/parquet.py
##########
@@ -511,14 +511,14 @@ def _sanitize_table(table, new_schema, flavor):
Write timestamps to INT96 Parquet format. Defaults to False unless enabled
by flavor argument. This take priority over the coerce_timestamps option.
coerce_timestamps : str, default None
- Cast timestamps a particular resolution. The defaults depends on `version`.
- For ``version='1.0'`` (the default), nanoseconds will be cast to
- microseconds ('us'), and seconds to milliseconds ('ms') by default.
- For ``version='2.4'``, nanoseconds will be cast to microseconds.
- For ``version='2.6'``, the original resolution is always preserved.
- The casting might result in loss of data, in which case
- ``allow_truncated_timestamps=True`` can be used to suppress the raised
- exception.
+ Cast timestamps to a particular resolution. If omitted, defaults are chosen
+ depending on `version`. By default, for ``version='1.0'`` (the default)
+ and ``version='2.4'``, nanoseconds are cast to microseconds ('us'), while
+ for other `version` values, they are written natively without loss
+ of resolution. Seconds are always cast to milliseconds ('ms') by default,
+ as Parquet does not have any temporal type with seconds resolution.
+ If the casting results in loss of data, it will raise an exception
+ unless ``allow_truncated_timestamps=True`` is given.
Review comment:
Very clear now!
##########
File path: cpp/src/parquet/type_fwd.h
##########
@@ -19,8 +19,50 @@
namespace parquet {
+/// \brief Feature selection when writing Parquet files
+///
+/// `ParquetVersion::type` governs which data types are allowed and how they
+/// are represented. For example, uint32_t data will be written differently
+/// depending on this value (as INT64 for PARQUET_1_0, as UINT32 for other
+/// versions).
+///
+/// However, some features - such as compression algorithms, encryption,
+/// or the improved "v2" data page format - must be enabled separately in
+/// ArrowWriterProperties.
struct ParquetVersion {
- enum type { PARQUET_1_0, PARQUET_2_0 };
+ enum type : int {
+ /// Enable only pre-2.0 Parquet format features when writing
+ ///
+ /// This setting is useful for maximum compatibility with legacy readers.
+ PARQUET_1_0,
Review comment:
TIMESTAMP_MILLIS is actually only from 2.2, so "pre 2.2" doesn't help
that much ... (but, yeah, this is a mess :))
##########
File path: python/pyarrow/_parquet.pxd
##########
@@ -133,7 +133,9 @@ cdef extern from "parquet/api/schema.h" namespace "parquet"
nogil:
enum ParquetVersion" parquet::ParquetVersion::type":
ParquetVersion_V1" parquet::ParquetVersion::PARQUET_1_0"
- ParquetVersion_V2" parquet::ParquetVersion::PARQUET_2_0"
+ ParquetVersion_V2_0" parquet::ParquetVersion::PARQUET_2_0"
+ ParquetVersion_V2_4" parquet::ParquetVersion::PARQUET_2_4"
+ ParquetVersion_V2_6" parquet::ParquetVersion::PARQUET_2_6"
Review comment:
We can always add this later, in case there is demand for it / a good
use case.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]