Hi Chao, Some concerns were came up about the 'mask' feature prior to this thread and there were already addressed by Xinli. Check PARQUET-1792 for details. I don't have anything else to add here and happy to review the code changes related to these features.
Cheers, Gabor On Thu, Feb 20, 2020 at 3:58 AM Chao Sun <[email protected]> wrote: > Bumping it up. Would love to get some feedback from community. > > Best, > Chao > > On Sun, Feb 16, 2020 at 7:20 PM Xinli shang <[email protected]> > wrote: > > > Hi all, > > > > I am developing tools to prune or mask some Parquet file columns for > > cost-saving or security & compliance purposes. I want to collect your > > thoughts on usefulness or concerns. Please reply to this email or comment > > on the tickets (Parquet-1791 > > <https://issues.apache.org/jira/browse/PARQUET-1791>, Parquet-1792 > > <https://issues.apache.org/jira/browse/PARQUET-1792>). > > > > These tools are going to be delivered in Parquet-tools/Parquet-cli > project > > and will be used as offline operations on existing files. In the prune > > case, a new Parquet file will be generated by cloning all the columns > > except for the ones to be pruned. There won’t be decoding/encoding, > > decompression/compression on those surviving columns. So the throughput > is > > much higher than doing it in query engines. For the masked case, we are > > going to do the same for the columns that don’t need to be masked. For > the > > columns to be masked, we are going to replace them with hash values, > nulls > > or user-defined translations. > > > > Why do we need these tools? > > > > 1. > > > > Analytics tables usually have a lot of columns but some columns are > > never used or not being used after some time. Removing those columns > can > > save storage costs. For some large scale tables, it could save several > > millions of dollars. > > 2. > > > > In some cases(not all though), the masked value for some sensitive > > columns, for example, the hash of the raw data, could be sufficient > for > > analytical purposes. So we can replace the raw data with hash, null > etc. > > 3. > > > > The alternative way to do so could be doing them in query engines. For > > example, select only the needed columns and save them into a new > table. > > But > > doing it in Parquet file directly can avoid decoding/encoding, > > decompression/compression and achieve higher (10+X in my testing) > > throughput. For large scale tables, high throughput is the key to > > success. > > > > > > Please note that this effort is not to replace data masking(obfuscation) > > effort (PARQUET-1376) which should be independent of this and move > forward. > > > > Thanks for spending time reading! Any comments are welcome! > > > > -- > > Xinli Shang > > >
