Hi, Since a new parquet sync is coming up this week, I would like to revive this topic for discussion.
In short, the point I tried to make was that supporting a specific parquet-format version is currently a "moving target" as we keep adding incompatible features without increasing the major version number. We don't provide compatibility profiles either that would allow writing data that can be read by implementations based on earlier versions. In my opinion, this is very bad for language bindings and other implementations (like Impala), because they can not claim parquet support in any release: - They can not claim that they support reading data stored according to a specific *major* version of parquet-format (e.g., parquet-format 2), because we keep adding incompatible features to parquet-format. - If they claim that they support reading data stored according to a *very specific* version of parquet-format (e.g., parquet-format 2.3.2), it will not really help their users, because without compatibility profiles users will not be able to force other components to *write* data according to that very specific version. Please let me know your thoughts. Thanks, Zoltan On Mon, Jan 29, 2018 at 6:56 PM Zoltan Ivanfi <z...@cloudera.com> wrote: > Hi, > > I have noticed that the recent addition of new compressions to > parquet-format happened in a patch version (in Semantic Versioning > terminology). I think this is a problem. > > Consider a theoretical library (or application) that implemented Parquet > according to parquet-format 2.3.0. By implementing a read path for all > features of parquet-format 2.3.0, this library should be able to > confidently state that it supports reading parquet-format 2 files. This is > the reason why minor and patch versions should not contain any breaking > changes. For example, when new data pages were added, the major format > version was bumped up from 1 to 2, so that claims of being able to read > parquet-format 1 files remained valid even after introducing this breaking > change. (This is not necessary for changes that are backwards-compatible, > e.g., the addition of optional statistics fields.) > > The new compressions were added in parquet-format 2.3.2 and when a writer > uses one of these new compressions, older readers won't be able to handle > that. For this breaking change, the major version should have been > incremented, so that earlier claims of being able to read parquet-format 2 > would have remained valid. Since this change is already released, I don't > know whether we can do anything about it at this point. > > On the write path, libraries/applications can support multiple > parquet-format versions at the same time. This is not mandatory, but when > implemented (like in parquet-mr), it allows users to write data files > targeted for data consumers that only support an earlier version. So, by > using WriterVersion.PARQUET_1_0, the resulting files will be consumable by > parquet-format 1 readers. Even if the writer itself supports new > parquet-format 2 features as well, it will not use them when writing these > files (again, unless said features are backwards-compatible). Implementing > this, however, requires a careful comparison of parquet-format 1 and > parquet-format 2. I think this is a problem in itself, which is made even > worse by the lack of a "version boundary" for the new compressions. > > I propose the following: > > - Introduce a notion of "profiles" for writing data that put > constraints on the set of breaking features that writers can use. This > would correspond to the current WriterVersion.PARQUET_1_0/PARQUET_2_0 > distinction, but it would be defined and specified in the parquet-format > documentation instead, so that implementors would not have to compare > different parquet-format versions to see what features were not available > in earlier versions. > - Introduce a new WriterVersion.PARQUET_2_3_2 (or > WriterVersion.PARQUET_2_4) that allows using the new compressions and > prohibit their usage in PARQUET_2_0 at the same time. > - Review the history of parquet-format, looking for other potentional > breaking features that need to be prohibited in PARQUET_2_0. > - Create a file metadata field for the profile used to writing the > file. This would mainly serve troubleshooting purposes: users (and support > personnel) should be able to check what profile configuration was in effect > when writing the files. Data consumers, however, should not refuse reading > data just because it was written with an unsupported profile, since the new > features may not have been actually used for the specific file (for > example, one can write a PARQUET_2_3_2 file without using the new > compressions, in which case older readers will be able to consume it). > > What do you think of this problem/proposal? > > Thanks, > > Zoltan > > On Thu, Jan 25, 2018 at 5:25 PM Zoltan Ivanfi <z...@cloudera.com> wrote: > >> Hi, >> >> Regarding whether there still remains a need for v2 pages after column >> indices: According to the comment above the v2 pages, their main advantage >> is that repetition and definition levels can be read without decompressing >> the data. I don't know what the exact use case is for doing so (please let >> me know if you do), but this is something that v1 pages do not allow. >> >> Regarding whether it is important to differentiate between v1 and v2 >> pages: Since Impala does not support v2 pages yet, this information seems >> to be useful to me. But we can even go one step furter: Even if we only >> write v1 pages, one could still end up with a file not readable by Impala >> by using non-compatible parquet-format 2 features, like RLE dictionary. >> Today writers generally use v2 pages for files that take advantage of >> format-2 features and v1 pages for files that only use format-1 features. >> Consequently, in practice the data page version seems to be a good >> indicator of which format version the file uses, although there should be a >> more reliable way to indetify format-2 Parquet files. I haven't found it >> yet though, did I overlook something? >> >> Thanks, >> >> Zoltan >> >> On Wed, Jan 24, 2018 at 5:47 PM Ryan Blue <rb...@netflix.com.invalid> >> wrote: >> >>> I don't think there's much value in knowing what pages were written. I'm >>> not even sure we need the v2 pages any more, since we've stated that >>> pages >>> should end on record boundaries if you're writing the the page index >>> structures. Maybe we should just add an optional record count to v1 and >>> stop using v2 pages. >>> >>> rb >>> >>> On Wed, Jan 24, 2018 at 8:07 AM, Zoltan Ivanfi <z...@cloudera.com> wrote: >>> >>> > Hi, >>> > >>> > We were looking for information in parquet-tools's output that would >>> tell >>> > us whether a given Parquet file uses v1 or v2 pages but haven't found >>> any. >>> > We also checked parquet-cli and haven't found anything specifically for >>> > this purpose there either, but we noticed in the source code that the >>> > strings >>> > produced for v1 and v2 pages >>> > <https://github.com/apache/parquet-mr/blob/master/ >>> > parquet-cli/src/main/java/org/apache/parquet/cli/commands/ >>> > ShowPagesCommand.java#L190> >>> > are ever-so-slighly different: for v1 pages no row count value is >>> printed >>> > (since there is no such field). >>> > >>> > After this discovery we could tell Parquet files using v1 and v2 pages >>> > apart by looking at the row counts, but I think there should be a >>> nicer way >>> > to do that. Should we add this functionality or did we overlook an >>> already >>> > existing one? >>> > >>> > Thanks, >>> > >>> > Zoltan >>> > >>> >>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >>