I'll try to spend an hour or two or so looking at the C++ encryption
patch this week. I'm going to be looking at:

* Public API changes -- it it's possible to make the encryption stuff
as unobtrusive to non-encrypted users, that would be preferred
* Stylistic problems: inconsistent naming style, etc.
* Code duplication

I notice at a glance that some very large methods have been written,
or that 50-100 lines or more of code have been added to methods. This
is sometimes a sign of a code factoring issue, and I find that
factoring code into helper methods makes it both easier to read and
easier to maintain.

Follow on patches will be decidedly required in order to address some
of these issues. In the meantime, CI is broken (the unit tests aren't
passing, and maybe some other issues) so nothing can be merged until
that is taken care of.

On Thu, May 30, 2019 at 7:42 AM Gidon Gershinsky <[email protected]> wrote:
>
> Thinking of this, it occurred to me that the words "interop" and "test"
> might be actually misleading in this context.
> What we have, is a C++ sample, similar to the current reader-writer.cc or
> the new encryption-reader-writer.cc (both located under the
> cpp/examples/parquet/low-level-api).
> The encryption-reader-writer.cc  writes one file using one encryption
> config, and reads it with one decryption config.
> Our sample is the same, but it writes N files with N different encryption
> configs, and reads all these files using M different decryption configs.
> All encryption/decryption configs are meaningful, we're adding detailed
> comments to each, so this sample becomes an "encryption HelloWorld" on
> steroids.
> A good place to see the code usage of API for different
> encryption/decryption modes.
>
> Now, since we have an identical sample in Java, we can test the
> interoperability - when files are produced by Java sample and read by C++
> sample, or vice versa.
>
> Given the first part, maybe we should place it together with other samples,
> under cpp/examples/parquet/low-level-api in the apache/arrow repo.
> This has an added benefit of getting built together with arrow/parquet,
> without dependency on other repos.
>
> Here goes the current version:
> https://github.com/newrevit13/arrow/blob/interop_tests/cpp/examples/parquet/low-level-api/encryption-reader-writer-all-crypto-options.cc
>
>
> Cheers, Gidon.
>
> On Thu, May 30, 2019 at 12:45 AM Wes McKinney <[email protected]> wrote:
>
> > If there are encryption integration testing utilities there's not much
> > reason not to put them in
> >
> > https://github.com/apache/arrow/tree/master/cpp/tools/parquet
> >
> > (or similar)
> >
> > Then there is no additional build system to maintain
> >
> > On Wed, May 29, 2019 at 4:41 PM Deepak Majeti <[email protected]>
> > wrote:
> > >
> > > Hi Gidon,
> > >
> > > We should definitely add encrypted parquet files written by parquet-mr in
> > > the parquet-testing repo data folder.
> > > I also like the idea of keeping the source files in a separate "src"
> > folder
> > > in that repository. I see no downside in doing so.
> > >
> > > On Tue, May 28, 2019 at 9:23 PM Gidon Gershinsky <[email protected]>
> > wrote:
> > >
> > > > Hi Wes,
> > > >
> > > > Yep, I meant private as in not under apache/arrow. Will be more careful
> > > > with that term.
> > > >
> > > > https://github.com/apache/parquet-testing  seems to currently contain
> > only
> > > > data (.parquet) files.
> > > > Our samples are C++ and Java code files, that write and read .parquet
> > files
> > > > in different encryption configurations.
> > > > We can add these encrypted .parquet files to the apache/parquet-testing
> > > > repo.
> > > > Let us know if we also should add the samples themselves there (by
> > creating
> > > > src folders in this repo), or add them to the apache/arrow/mr repos,
> > or to
> > > > keep them in our personal repos (the ones I've called private before).
> > > > These samples are based on the standard parquet samples (such as
> > > >
> > > >
> > https://github.com/apache/arrow/blob/master/cpp/examples/parquet/low-level-api/reader-writer.cc
> > > > ),
> > > > with a loop that creates different encryption configs, and
> > writes/reads the
> > > > same data with them.
> > > >
> > > > Cheers, Gidon.
> > > >
> > > >
> > > > On Tue, May 28, 2019 at 5:51 PM Wes McKinney <[email protected]>
> > wrote:
> > > >
> > > > > hi Gidon,
> > > > >
> > > > > Surely you don't mean "private" private -- this was always on GitHub
> > > > > right (otherwise this runs against the Apache Way)?
> > > > >
> > > > > I think we should try to have all of the testing code in the upstream
> > > > > repositories. If there are sample files you need to check in (and
> > they
> > > > > are not too large), please use the apache/parquet-testing repository
> > > > >
> > > > > - Wes
> > > > >
> > > > > On Tue, May 28, 2019 at 8:00 AM Gidon Gershinsky <[email protected]>
> > > > wrote:
> > > > > >
> > > > > > Hi Deepak,
> > > > > >
> > > > > > Sounds good. Here goes the current cpp interop sample (not final
> > yet).
> > > > We
> > > > > > have a sample with a similar code in Java.
> > > > > >
> > > > >
> > > >
> > https://github.com/newrevit13/arrow/blob/interop_tests/cpp/examples/parquet/low-level-api/encryption-interop-tests.cc
> > > > > >
> > > > > >
> > > > > > Please let me know if we should add this file to the cpp PR and its
> > > > > staging
> > > > > > branch (under parquet/low-level-api, or another folder?)
> > > > > > - or leave this file in our private branch.
> > > > > >
> > > > > > Cheers, Gidon.
> > > > > >
> > > > > > On Tue, May 28, 2019 at 7:25 AM Deepak Majeti <
> > [email protected]
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Gidon,
> > > > > > >
> > > > > > > I agree with your point 2). I merged the PR for OpenSSL build
> > > > support.
> > > > > We
> > > > > > > can do the same for PR-3520 after you push the changes this
> > week. We
> > > > > can
> > > > > > > then simply rebase PR-2555 and make the final review.
> > > > > > > For 1) I would like to take a look at the inter-op testing
> > framework
> > > > > first.
> > > > > > > Can you share the inter-op sample? Thanks.
> > > > > > >
> > > > > > >
> > > > > > > On Mon, May 27, 2019 at 7:21 PM Gidon Gershinsky <
> > [email protected]>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi Deepak,
> > > > > > > >
> > > > > > > > If it simplifies the code review, we can make this change.
> > > > Actually,
> > > > > > > we're
> > > > > > > > already creating a single branch that has the parquet-cpp with
> > > > > encryption
> > > > > > > > in one place, ready for build and sample run. This is a private
> > > > > branch -
> > > > > > > > but we can do the same for the pull request (2555), making it
> > the
> > > > > staging
> > > > > > > > branch. However, there are a number of technical details we
> > need to
> > > > > > > address
> > > > > > > > first:
> > > > > > > >
> > > > > > > > 1) The private branch we're creating contains not only the cpp
> > > > > encryption
> > > > > > > > sample, but also an additional interop sample that checks
> > > > > > > interoperability
> > > > > > > > with parquet-mr encryption.
> > > > > > > > Do we want to add this interop sample to the staging branch?
> > Should
> > > > > this
> > > > > > > > sample be merged in arrow/parquet-cpp for future interop
> > testing
> > > > with
> > > > > > > Java
> > > > > > > > encryption? Or this sample can stay in the staging branch for
> > the
> > > > > > > duration
> > > > > > > > of the review process, and be removed just before the staging
> > PR is
> > > > > > > merged
> > > > > > > > in arrow/parquet-cpp? Or we dont add this interop sample to the
> > > > > staging
> > > > > > > > branch at all, and keep our private branch for interop
> > testing? Let
> > > > > us
> > > > > > > know
> > > > > > > > your preference.
> > > > > > > >
> > > > > > > > 2) PR 3520 has been reviewed. We'll make the requested changes
> > this
> > > > > week,
> > > > > > > > and update the PR. It adds 2 new files, and doesnt touch the
> > > > existing
> > > > > > > ones
> > > > > > > > - if you guys will be ok with the updated version, you can
> > merge
> > > > it,
> > > > > and
> > > > > > > > then the staging PR (2555) will automatically pick the 3520
> > patch
> > > > > upon
> > > > > > > > rebasing to the then-current master state. But, if further
> > changes
> > > > > will
> > > > > > > be
> > > > > > > > required in the #3520 files, we'll add them to the staging PR.
> > > > > > > >
> > > > > > > > Cheers, Gidon.
> > > > > > > >
> > > > > > > > On Fri, May 24, 2019 at 9:02 PM Deepak Majeti <
> > > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Gidon and Tham,
> > > > > > > > >
> > > > > > > > > I opened a PR for OpenSSL build support here:
> > > > > > > > > https://github.com/apache/arrow/pull/4384/files
> > > > > > > > > However, your PRs need to be rebased/merged with master.
> > > > > > > > > Why don't we keep one staging PR for C++? Probably make
> > > > > > > > > https://github.com/apache/arrow/pull/2555 the staging PR and
> > > > > > > cherry-pick
> > > > > > > > > https://github.com/apache/arrow/pull/3520/files and the
> > above PR
> > > > > onto
> > > > > > > > it.
> > > > > > > > > The later PRs are tiny anyway. This simplifies code review,
> > > > syncing
> > > > > > > with
> > > > > > > > > the master, as well as building the encryption work while
> > > > > reviewing.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, May 22, 2019 at 12:43 PM Gidon Gershinsky <
> > > > > [email protected]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks guys. We'll create a doc with a description of the
> > > > current
> > > > > > > > interop
> > > > > > > > > > tests, including details on how to build/run them.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, May 22, 2019 at 7:22 PM Deepak Majeti <
> > > > > > > [email protected]
> > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > I will make the first attempt to review the C++ code. I
> > > > > reviewed
> > > > > > > part
> > > > > > > > > of
> > > > > > > > > > it
> > > > > > > > > > > a while ago.
> > > > > > > > > > > Java-C++ inter-operation testing is my main concern. Can
> > you
> > > > > share
> > > > > > > > some
> > > > > > > > > > > details on what has been tested?
> > > > > > > > > > > I need to update myself with the key management part and
> > see
> > > > > if we
> > > > > > > > can
> > > > > > > > > > add
> > > > > > > > > > > some test parquet files written in Java along with the
> > keys
> > > > to
> > > > > the
> > > > > > > > data
> > > > > > > > > > > folder.
> > > > > > > > > > >
> > > > > > > > > > > On Wed, May 22, 2019 at 10:18 AM Wes McKinney <
> > > > > [email protected]
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > hi folks,
> > > > > > > > > > > >
> > > > > > > > > > > > I'm extraordinarily busy but it seems likely that the
> > code
> > > > > review
> > > > > > > > for
> > > > > > > > > > > > Parquet C++ is largely going to fall on my shoulders.
> > Since
> > > > > this
> > > > > > > > is a
> > > > > > > > > > > > large patch with many touch-points in the C++
> > codebase, and
> > > > > there
> > > > > > > > > > > > hasn't been much feedback over the course of its
> > > > > development, I
> > > > > > > > would
> > > > > > > > > > > > like to leave thoughtful code review so that we can
> > avoid
> > > > > issues
> > > > > > > > that
> > > > > > > > > > > > may lead to burdensome maintenance issues.
> > > > > > > > > > > >
> > > > > > > > > > > > On the OpenSSL issue -- there are many CMake projects
> > in
> > > > the
> > > > > wild
> > > > > > > > > that
> > > > > > > > > > > > use OpenSSL and there are many examples of how to
> > > > incorporate
> > > > > > > other
> > > > > > > > > > > > dependencies into the Arrow build system. I don't have
> > time
> > > > > to
> > > > > > > work
> > > > > > > > > on
> > > > > > > > > > > > this right now so someone else will have to sort out
> > the
> > > > > > > toolchain
> > > > > > > > > > > > issue
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, May 20, 2019 at 8:48 AM Gidon Gershinsky <
> > > > > > > [email protected]
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > <re-sending from another account>
> > > > > > > > > > > > >
> > > > > > > > > > > > > Tham, thank you for this! and for volunteering early
> > for
> > > > > the
> > > > > > > C++
> > > > > > > > > > > version
> > > > > > > > > > > > > work, driving it forward and creating the bulk of the
> > > > > > > parquet-cpp
> > > > > > > > > > > > > encryption code along the way.
> > > > > > > > > > > > >
> > > > > > > > > > > > > @All - this announcement means that two
> > implementations
> > > > of
> > > > > > > > Parquet
> > > > > > > > > > > > > encryption, fully conforming to the formal
> > specification,
> > > > > are
> > > > > > > > > > available
> > > > > > > > > > > > > today. Thanks to Revital for contributing to C++
> > version
> > > > > > > > compliance
> > > > > > > > > > > with
> > > > > > > > > > > > > the encryption spec, and for running a set of basic
> > > > > Java-C++
> > > > > > > > > > encryption
> > > > > > > > > > > > > interoperability tests. We have tested plaintext and
> > > > > encrypted
> > > > > > > > > footer
> > > > > > > > > > > > > modes, GCM and GCM_CTR algorithms, new and legacy
> > > > readers.
> > > > > > > Files
> > > > > > > > > > > written
> > > > > > > > > > > > > with parquet-cpp are successfully parsed by
> > parquet-mr,
> > > > and
> > > > > > > vice
> > > > > > > > > > versa.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Let me also thank Junjie, Nandor, Anna and Xinli for
> > > > their
> > > > > > > > support
> > > > > > > > > > and
> > > > > > > > > > > > vote
> > > > > > > > > > > > > for the encryption specification - along with the PMC
> > > > > folks.
> > > > > > > > > > > > >
> > > > > > > > > > > > > All parquet-format pull requests are merged by now
> > into
> > > > the
> > > > > > > > > > encryption
> > > > > > > > > > > > > branch,
> > > > > > > > > > > > >
> > https://github.com/apache/parquet-format/tree/encryption
> > > > > > > > > > > > >
> > > > > > > > > > > > > The community is welcome to review the parquet-mr
> > pull
> > > > > > > requests,
> > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > > > > following order:
> > > > > > > > > > > > > https://github.com/apache/parquet-mr/pull/613
> > > > > > > > > > > > > https://github.com/apache/parquet-mr/pull/614
> > > > > > > > > > > > > https://github.com/apache/parquet-mr/pull/643
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Currently, an end-to-end implementation of Java (mr)
> > > > > Parquet
> > > > > > > > > > encryption
> > > > > > > > > > > > is
> > > > > > > > > > > > > collected in this branch:
> > > > > > > > > > > > > https://github.com/ggershinsky/parquet-mr/tree/encr
> > > > > > > > > > > > > Thanks to Xinli for working with this branch code,
> > and
> > > > > > > > contributing
> > > > > > > > > > to
> > > > > > > > > > > it
> > > > > > > > > > > > > based on his field experience. Everybody is welcome
> > to do
> > > > > the
> > > > > > > > same.
> > > > > > > > > > > > > @All - it would be helpful to review & merge the
> > above
> > > > PRs
> > > > > in
> > > > > > > > > > > > > apache/parquet-mr/encryption, so that folks can work
> > with
> > > > > it
> > > > > > > > > instead
> > > > > > > > > > of
> > > > > > > > > > > > my
> > > > > > > > > > > > > private branch..
> > > > > > > > > > > > >
> > > > > > > > > > > > > And I certainly second Tham's call to review & merge
> > the
> > > > > > > > > parquet-cpp
> > > > > > > > > > > pull
> > > > > > > > > > > > > requests. By now, we have a number of companies
> > starting
> > > > to
> > > > > > > > utilize
> > > > > > > > > > > > Parquet
> > > > > > > > > > > > > encryption (both C++ and Java), including IBM.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers, Gidon.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, May 20, 2019 at 1:40 PM Tham Ha <
> > [email protected]
> > > > >
> > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi community,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > After a long time of development, I'm honor to
> > announce
> > > > > that
> > > > > > > we
> > > > > > > > > > have
> > > > > > > > > > > > just
> > > > > > > > > > > > > > completed C++ parquet encryption module which
> > > > implements
> > > > > > > > > encryption
> > > > > > > > > > > in
> > > > > > > > > > > > low
> > > > > > > > > > > > > > level api and with examples included.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > To have this feature completed, I would like to
> > thank
> > > > > Gidon
> > > > > > > and
> > > > > > > > > > > > Revital for
> > > > > > > > > > > > > > their contribution.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Gidon had a key role in encryption design and in
> > > > writing
> > > > > Java
> > > > > > > > > > version
> > > > > > > > > > > > code
> > > > > > > > > > > > > > on which we based on to write C++ version. He also
> > > > wrote
> > > > > > > crypto
> > > > > > > > > > > > package in
> > > > > > > > > > > > > > C++ version.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Revital and me has been joining together in
> > writing C++
> > > > > > > > version.
> > > > > > > > > > > > Revital
> > > > > > > > > > > > > > was responsible for AAD calculations, API updating
> > (to
> > > > > be the
> > > > > > > > > same
> > > > > > > > > > > with
> > > > > > > > > > > > > > Java version) and Java-C++ inter-operation
> > testing. I
> > > > was
> > > > > > > > writing
> > > > > > > > > > the
> > > > > > > > > > > > first
> > > > > > > > > > > > > > draft (properties, metadata, writer, reader) and
> > > > keeping
> > > > > them
> > > > > > > > > > updated
> > > > > > > > > > > > when
> > > > > > > > > > > > > > crypto package change.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > We have had a great time to cooperate. Thank Gidon
> > and
> > > > > > > Revital
> > > > > > > > > for
> > > > > > > > > > > all
> > > > > > > > > > > > > > guide and experience I have received from them,
> > too.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Here are the links of pull requests:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1) encryption module (properties, metadata, writer,
> > > > > reader):
> > > > > > > > > > > > > > https://github.com/apache/arrow/pull/2555.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 2) some merged pull requests for new thrift
> > structure
> > > > and
> > > > > > > > crypto
> > > > > > > > > > > > algorithm,
> > > > > > > > > > > > > > and one still open:
> > > > > > > https://github.com/apache/arrow/pull/3520
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > However, in order to make (1) buildable with
> > current
> > > > > build
> > > > > > > > > scripts,
> > > > > > > > > > > we
> > > > > > > > > > > > need
> > > > > > > > > > > > > > “adding openssl in C++ build toolchain” which is
> > > > > mentioned in
> > > > > > > > > this
> > > > > > > > > > > > jira:
> > > > > > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-4302.
> > I
> > > > > will be
> > > > > > > > > > grateful
> > > > > > > > > > > > if
> > > > > > > > > > > > > > someone could help fullfill this work.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > About current pull requests, they has been
> > currently
> > > > > using in
> > > > > > > > our
> > > > > > > > > > > > > > development phase at Emotiv (
> > https://www.emotiv.com/).
> > > > > We
> > > > > > > love
> > > > > > > > > > using
> > > > > > > > > > > > > > parquet files to store EEG data. We are going to
> > > > release
> > > > > a
> > > > > > > > > product
> > > > > > > > > > > with
> > > > > > > > > > > > > > encrypted parquet files soon and look forward to
> > the
> > > > > official
> > > > > > > > > > release
> > > > > > > > > > > > of
> > > > > > > > > > > > > > parquet encryption feature. So it will be many
> > thank
> > > > and
> > > > > > > great
> > > > > > > > > > honor
> > > > > > > > > > > to
> > > > > > > > > > > > > > have you review and merge them (if qualified).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thank you very much!
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Tham
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > regards,
> > > > > > > > > > > Deepak Majeti
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > regards,
> > > > > > > > > Deepak Majeti
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > regards,
> > > > > > > Deepak Majeti
> > > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > regards,
> > > Deepak Majeti
> >

Reply via email to