trying to project columns without authorization can be very costly, for two reasons: - unnecessary per-column/file calls to the (remote) KMS service, plus the cost of per-call authorization checks - red-flagging unauthorized calls and triggering "breach attempt" alerts
IMO, the best way to handle this is to have a layer on top of parquet - that gets the list of authorized columns for the reader (eg from a policy engine), and allows to project only them (returning nulls for the others) Cheers, Gidon On Thu, Oct 27, 2022 at 1:01 AM nicolas paris <[email protected]> wrote: > hello, > > as mentionned in several places [1], from a data analyst point of view, > having null values for encrypted columns when one has no key to decrypt > is better than getting exceptions, and ease the data exploration > allowing select * instead of writing each allowed columns. > > I have been digging the crypto source code to find a easy way to catch > crypto exception and turn values to null from the > DecryptionPropertiesFactory that can be passed to the query engine > thought hadoop configs. > > I might be missing something, but I haven't found a way to tell the > ParquetReader to put nulls and go ahead reading un-encrypted columns > when something get wrong with the KMS. > > Is such behavior available or are you willing to add such feature at > parquet level in the future ? > > Thanks > > > [1] > > https://www.uber.com/en-FR/blog/one-stone-three-birds-finer-grained-encryption-apache-parquet/ >
