Hi all, I am developing tools to prune or mask some Parquet file columns for cost-saving or security & compliance purposes. I want to collect your thoughts on usefulness or concerns. Please reply to this email or comment on the tickets (Parquet-1791 <https://issues.apache.org/jira/browse/PARQUET-1791>, Parquet-1792 <https://issues.apache.org/jira/browse/PARQUET-1792>).
These tools are going to be delivered in Parquet-tools/Parquet-cli project and will be used as offline operations on existing files. In the prune case, a new Parquet file will be generated by cloning all the columns except for the ones to be pruned. There won’t be decoding/encoding, decompression/compression on those surviving columns. So the throughput is much higher than doing it in query engines. For the masked case, we are going to do the same for the columns that don’t need to be masked. For the columns to be masked, we are going to replace them with hash values, nulls or user-defined translations. Why do we need these tools? 1. Analytics tables usually have a lot of columns but some columns are never used or not being used after some time. Removing those columns can save storage costs. For some large scale tables, it could save several millions of dollars. 2. In some cases(not all though), the masked value for some sensitive columns, for example, the hash of the raw data, could be sufficient for analytical purposes. So we can replace the raw data with hash, null etc. 3. The alternative way to do so could be doing them in query engines. For example, select only the needed columns and save them into a new table. But doing it in Parquet file directly can avoid decoding/encoding, decompression/compression and achieve higher (10+X in my testing) throughput. For large scale tables, high throughput is the key to success. Please note that this effort is not to replace data masking(obfuscation) effort (PARQUET-1376) which should be independent of this and move forward. Thanks for spending time reading! Any comments are welcome! -- Xinli Shang
