Hi,

Please let me know your opinions as well. So far all concerns were only
raised by me, which may indicate that other community members do not
consider this issue serious and in this case my suggestions may be
excessive and unjustified.

Just to clarify: The data structures for the encrypion feature are unlikely
to change. I solely suggest these measures to remedy the situation of
parquet.thrift having been released in parquet-cpp before officially
getting released in parquet-format.

Thanks,

Zoltan

On Wed, Sep 26, 2018 at 2:29 PM Zoltan Ivanfi <z...@cloudera.com> wrote:

> Hi,
>
> I think it's safer if we skip id 8 altogether and use id 9 for the new
> crypto structure. This way we don't have to worry about remaining backwards
> compatible with the accidentally released structure.
>
> Br,
>
> Zoltan
>
> On Wed, Sep 26, 2018 at 2:25 PM Gidon Gershinsky <gg5...@gmail.com> wrote:
>
>> Hi,
>>
>> I think we should use this id for its current purpose. This field had been
>> defined and merged months ago, and should be there is any scenario.
>> Except for the last week's change, the encryption format had been stable
>> for a while now. The timing of this change was unfortunate; but the change
>> was minor and done for a good reason.
>> I hope there won't be new disturbances from now on.
>>
>> Cheers, Gidon.
>>
>>
>> On Wed, Sep 26, 2018 at 3:10 PM Zoltan Ivanfi <z...@cloudera.com.invalid>
>> wrote:
>>
>> > Hi,
>> >
>> > It seems that I spoke too early. I just noticed that a new field was
>> added
>> > to the ColumnChunk struct in
>> >
>> >
>> https://github.com/apache/parquet-cpp/pull/463/files#diff-0589b447b73e51c88d9b2bdcb0957084R708
>> > Although the field is optional, we can't pretend it was never released,
>> > because parquet-cpp already expects that field for the id 8. Therefore,
>> if
>> > we were to add a different field with id 8 to the ColumnChunk struct,
>> the
>> > thrift parser of parquet-cpp 1.5.0 would probably throw an error when
>> > reading new files, because it would try to read the ColumnChunk struct
>> > including all of its fields (regardless of whether the code actually
>> uses
>> > them or not) and the field with id 8 would not match its Thrift schema.
>> >
>> > We still can avoid breaking changes in the ColumnChunk struct by either
>> > using id 8 for its current purpose or not using it at all (skip id 8 and
>> > use 9 for the next field we add).
>> >
>> > Br,
>> >
>> > Zoltan
>> >
>> > On Wed, Sep 26, 2018 at 1:41 PM Zoltan Ivanfi <z...@cloudera.com> wrote:
>> >
>> > > Hi,
>> > >
>> > > If the encryption code release in parquet-cpp is unused at this moment
>> > > then I think we are fine. It means that we are still free to decide
>> any
>> > way
>> > > about the data structures without the risk of incompatility issues. In
>> > this
>> > > case I would suggest to proceed as we planned at the Parquet sync.
>> > >
>> > > Thanks,
>> > >
>> > > Zoltan
>> > >
>> > > On Wed, Sep 26, 2018 at 7:49 AM Gidon Gershinsky <gg5...@gmail.com>
>> > wrote:
>> > >
>> > >> That's correct. A layer that runs the crypto classes to encrypt pages
>> > and
>> > >> structures is not merged yet. Even if the parquet.thrift is out,
>> there
>> > is
>> > >> virtually no chance encrypted files are created with it, and
>> certainly
>> > not
>> > >> in production.
>> > >>
>> > >> Given a (hopefully mild) headache this last-minute
>> feature/requirement
>> > has
>> > >> caused, let me add details on where it is coming from. I received the
>> > >> requirement from a company that got an internal pushback from teams
>> > >> working
>> > >> with unencrypted columns only, and unwilling to update Parquet libs.
>> > They
>> > >> need support for a transition period for the current readers.
>> > >>
>> > >> We have an option to say no, and stick to the original goal list.
>> But I
>> > >> think the requirement makes sense, even if coming very late - and
>> will
>> > >> significantly ease adoption of the Parquet encryption. Also, it turns
>> > out
>> > >> to be easy to implement, with minor changes in the code.
>> > >>
>> > >> In any case, its certainly a good idea to keep a single copy of
>> > >> parquet.thrift in the format repo.
>> > >>
>> > >> Cheers, Gidon
>> > >>
>> > >> On Wed, Sep 26, 2018 at 12:55 AM Wes McKinney <wesmck...@gmail.com>
>> > >> wrote:
>> > >>
>> > >> > AFAIK encryption is not fully supported in parquet-cpp 1.5.0 so
>> > >> > reading and writing files with encryption is not possible. Am I
>> wrong
>> > >> > about that?
>> > >> >
>> > >> > I'm wholly supportive of changing the project to use a released
>> > >> > version of the Parquet format, so let's do that ASAP.
>> > >> > On Tue, Sep 25, 2018 at 1:30 PM Gidon Gershinsky <gg5...@gmail.com
>> >
>> > >> wrote:
>> > >> > >
>> > >> > > Yep! (sent in parallel :)
>> > >> > >
>> > >> > > On Tue, Sep 25, 2018 at 8:19 PM Zoltan Ivanfi
>> > <z...@cloudera.com.invalid
>> > >> >
>> > >> > > wrote:
>> > >> > >
>> > >> > > > Hi,
>> > >> > > >
>> > >> > > > As a short update, I just checked the PR for PARQUET-1419 and
>> > >> although
>> > >> > in
>> > >> > > > its current form it is a breaking change, it can be easily
>> > >> rewritten to
>> > >> > > > become backwards-compatible so this part of the problem does
>> not
>> > >> apply
>> > >> > any
>> > >> > > > more.
>> > >> > > >
>> > >> > > > Br,
>> > >> > > >
>> > >> > > > Zoltan
>> > >> > > >
>> > >> > > > On Tue, Sep 25, 2018 at 7:10 PM Zoltan Ivanfi <z...@cloudera.com
>> >
>> > >> wrote:
>> > >> > > >
>> > >> > > > > Hi,
>> > >> > > > >
>> > >> > > > > On the Parquet sync we discussed that the practice of
>> > maintaining
>> > >> a
>> > >> > copy
>> > >> > > > > of parquet.thrift in parquet-cpp is dangerous and that we
>> must
>> > >> take
>> > >> > care
>> > >> > > > to
>> > >> > > > > not release parquet-format changes in parquet-cpp before we
>> > >> > officially
>> > >> > > > > release them in parquet-format. As I got back to my computer
>> and
>> > >> > started
>> > >> > > > to
>> > >> > > > > create a JIRA about this, I noticed that unfortunately this
>> has
>> > >> > already
>> > >> > > > > happened.
>> > >> > > > >
>> > >> > > > > The encryption-releated parquet.thrift changes have not only
>> > been
>> > >> > added
>> > >> > > > to
>> > >> > > > > only parquet-format, but to parquet-cpp as well, and these
>> > changes
>> > >> > got
>> > >> > > > > released in parquet-cpp 1.5.0. This is very unfortunate,
>> because
>> > >> > > > > PARQUET-1419 would change the encryption in a breaking way,
>> > which
>> > >> is
>> > >> > only
>> > >> > > > > acceptable as long as the original is not released.
>> > Additionally,
>> > >> it
>> > >> > has
>> > >> > > > > been discussed that a formal voting should take place before
>> > >> > > > incorporating
>> > >> > > > > the encryption features in the format.
>> > >> > > > >
>> > >> > > > > Now that parquet-cpp has already shipped these changes, we
>> must
>> > >> > choose
>> > >> > > > the
>> > >> > > > > lesser evil of the following two options:
>> > >> > > > >
>> > >> > > > >    - Release a parquet-cpp 1.6.0 with a breaking change and
>> risk
>> > >> that
>> > >> > > > >    encrypted data files already written with parquet-cpp
>> 1.5.0
>> > >> will
>> > >> > not
>> > >> > > > be
>> > >> > > > >    readable any more.
>> > >> > > > >    - Release the encryption in parquet-format as it is,
>> > >> regardless of
>> > >> > > > >    voting results and discard PARQUET-1419.
>> > >> > > > >
>> > >> > > > > Personally I have a hard time deciding which one I consider
>> > lesser
>> > >> > evil.
>> > >> > > > > What are your opinions?
>> > >> > > > >
>> > >> > > > > Thanks,
>> > >> > > > >
>> > >> > > > > Zoltan
>> > >> > > > >
>> > >> > > >
>> > >> >
>> > >>
>> > >
>> >
>>
>

Reply via email to