Bumping it up. Would love to get some feedback from community. Best, Chao
On Sun, Feb 16, 2020 at 7:20 PM Xinli shang <[email protected]> wrote: > Hi all, > > I am developing tools to prune or mask some Parquet file columns for > cost-saving or security & compliance purposes. I want to collect your > thoughts on usefulness or concerns. Please reply to this email or comment > on the tickets (Parquet-1791 > <https://issues.apache.org/jira/browse/PARQUET-1791>, Parquet-1792 > <https://issues.apache.org/jira/browse/PARQUET-1792>). > > These tools are going to be delivered in Parquet-tools/Parquet-cli project > and will be used as offline operations on existing files. In the prune > case, a new Parquet file will be generated by cloning all the columns > except for the ones to be pruned. There won’t be decoding/encoding, > decompression/compression on those surviving columns. So the throughput is > much higher than doing it in query engines. For the masked case, we are > going to do the same for the columns that don’t need to be masked. For the > columns to be masked, we are going to replace them with hash values, nulls > or user-defined translations. > > Why do we need these tools? > > 1. > > Analytics tables usually have a lot of columns but some columns are > never used or not being used after some time. Removing those columns can > save storage costs. For some large scale tables, it could save several > millions of dollars. > 2. > > In some cases(not all though), the masked value for some sensitive > columns, for example, the hash of the raw data, could be sufficient for > analytical purposes. So we can replace the raw data with hash, null etc. > 3. > > The alternative way to do so could be doing them in query engines. For > example, select only the needed columns and save them into a new table. > But > doing it in Parquet file directly can avoid decoding/encoding, > decompression/compression and achieve higher (10+X in my testing) > throughput. For large scale tables, high throughput is the key to > success. > > > Please note that this effort is not to replace data masking(obfuscation) > effort (PARQUET-1376) which should be independent of this and move forward. > > Thanks for spending time reading! Any comments are welcome! > > -- > Xinli Shang >
