[jira] [Updated] (PARQUET-1178) Parquet modular encryption
[ https://issues.apache.org/jira/browse/PARQUET-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-1178: -- Fix Version/s: (was: format-2.7.0) > Parquet modular encryption > -- > > Key: PARQUET-1178 > URL: https://issues.apache.org/jira/browse/PARQUET-1178 > Project: Parquet > Issue Type: New Feature >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > > A mechanism for modular encryption and decryption of Parquet files. Allows to > keep data fully encrypted in the storage - while enabling efficient analytics > on the data, via reader-side extraction / authentication / decryption of data > subsets required by columnar projection and predicate push-down. > Enables fine-grained access control to column data by encrypting different > columns with different keys. > Supports a number of encryption algorithms, to account for different security > and performance requirements. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1178) Parquet modular encryption
[ https://issues.apache.org/jira/browse/PARQUET-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034916#comment-17034916 ] Jason Brugger commented on PARQUET-1178: What's the best way to get started with this on a Databricks cluster? If I install _format-2.7.0_ as a new library, how would I reference this data source in lieu of the cluster's default parquet library? > Parquet modular encryption > -- > > Key: PARQUET-1178 > URL: https://issues.apache.org/jira/browse/PARQUET-1178 > Project: Parquet > Issue Type: New Feature >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > Fix For: format-2.7.0 > > > A mechanism for modular encryption and decryption of Parquet files. Allows to > keep data fully encrypted in the storage - while enabling efficient analytics > on the data, via reader-side extraction / authentication / decryption of data > subsets required by columnar projection and predicate push-down. > Enables fine-grained access control to column data by encrypting different > columns with different keys. > Supports a number of encryption algorithms, to account for different security > and performance requirements. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1770) [C++][CI] Add fuzz target for reading Parquet files
[ https://issues.apache.org/jira/browse/PARQUET-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated PARQUET-1770: Labels: pull-request-available (was: ) > [C++][CI] Add fuzz target for reading Parquet files > --- > > Key: PARQUET-1770 > URL: https://issues.apache.org/jira/browse/PARQUET-1770 > Project: Parquet > Issue Type: Task > Components: parquet-cpp >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > Now that Arrow has been accepted on OSS-Fuzz, we should check for crashes and > potential vulnerabilities when reading Parquet files. > The Parquet fuzz target should use similar conventions as the IPC fuzz > targets in {{cpp/src/arrow/ipc/}}. An executable to generate a seed corpus > should be added as well, to make fuzzing more efficient. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (PARQUET-1770) [C++][CI] Add fuzz target for reading Parquet files
[ https://issues.apache.org/jira/browse/PARQUET-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned PARQUET-1770: --- Assignee: Antoine Pitrou > [C++][CI] Add fuzz target for reading Parquet files > --- > > Key: PARQUET-1770 > URL: https://issues.apache.org/jira/browse/PARQUET-1770 > Project: Parquet > Issue Type: Task > Components: parquet-cpp >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > > Now that Arrow has been accepted on OSS-Fuzz, we should check for crashes and > potential vulnerabilities when reading Parquet files. > The Parquet fuzz target should use similar conventions as the IPC fuzz > targets in {{cpp/src/arrow/ipc/}}. An executable to generate a seed corpus > should be added as well, to make fuzzing more efficient. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (PARQUET-1792) Add 'mask' command to parquet-tools/parquet-cli
[ https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034588#comment-17034588 ] Xinli Shang edited comment on PARQUET-1792 at 2/11/20 4:36 PM: --- [~gershinsky], this is just a simple offline tool to replace the raw columns with masked value. It is different from what we talked about earlier for the data obfuscation feature. The difference is that users have to run this tool explicitly and they are aware of what the data to be after translation. There is no chance that they accidentally or implicitly, or doing it by default. The tool can provide a different way to translate the raw data to masked value and can allow the user to define their own if they have security concerns. We just provide the tool to make their work easier. In addition, ORC already has those mask mechanism released. As mentioned earlier, I can send an email to dev email group to see if they have the needs of this tool. Again, this proposal is independent of the data obfuscation that we are jointly working on it. was (Author: sha...@uber.com): [~gershinsky], this is just a simple offline tool to replace the raw columns with masked value. It is different from what we talked about earlier for the data obfuscation feature. The difference is that users have to run this tool explicitly and they are aware of what the data to be after translation. There is no chance that they accidentally, implicitly or doing it by default. The tool can provide a different way to translate the raw data to masked value and can allow the user to define their own if they have security concerns. We just provide the tool to make their work easier. In addition, ORC already has those mask mechanism released. As mentioned earlier, I can send an email to dev email group to see if they have the needs of this tool. Again, this proposal is independent of the data obfuscation that we are jointly working on it. > Add 'mask' command to parquet-tools/parquet-cli > --- > > Key: PARQUET-1792 > URL: https://issues.apache.org/jira/browse/PARQUET-1792 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > Some personal data columns need to be masked instead of being > pruned(Parquet-1791). We need a tool to replace the raw data columns with > masked value. The masked value could be hash, null, redact etc. For the > unchanged columns, they should be moved as a whole like 'merge', 'prune' > command in Parquet-tools. > > Implementing this feature in file format is 10X faster than doing it by > rewriting the table data in the query engine. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1792) Add 'mask' command to parquet-tools/parquet-cli
[ https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034588#comment-17034588 ] Xinli Shang commented on PARQUET-1792: -- [~gershinsky], this is just a simple offline tool to replace the raw columns with masked value. It is different from what we talked about earlier for the data obfuscation feature. The difference is that users have to run this tool explicitly and they are aware of what the data to be after translation. There is no chance that they accidentally, implicitly or doing it by default. The tool can provide a different way to translate the raw data to masked value and can allow the user to define their own if they have security concerns. We just provide the tool to make their work easier. In addition, ORC already has those mask mechanism released. As mentioned earlier, I can send an email to dev email group to see if they have the needs of this tool. Again, this proposal is independent of the data obfuscation that we are jointly working on it. > Add 'mask' command to parquet-tools/parquet-cli > --- > > Key: PARQUET-1792 > URL: https://issues.apache.org/jira/browse/PARQUET-1792 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > Some personal data columns need to be masked instead of being > pruned(Parquet-1791). We need a tool to replace the raw data columns with > masked value. The masked value could be hash, null, redact etc. For the > unchanged columns, they should be moved as a whole like 'merge', 'prune' > command in Parquet-tools. > > Implementing this feature in file format is 10X faster than doing it by > rewriting the table data in the query engine. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1792) Add 'mask' command to parquet-tools/parquet-cli
[ https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034553#comment-17034553 ] Xinli Shang commented on PARQUET-1792: -- [~gszadovszky] the tool can be run in parallel in a cluster. For example, we can easily write a Spark application to do it. Actually even for 'prune', we still need to write Spark application to parallel it. Otherwise, the time to finish is still significant, although it is already faster than doing it in query engines. Regarding reading the original value and generating the hash/statistics, we only need to do it for the columns to be masked. In many cases, what we see is that there are only very few columns to be masked. For all other columns that don't need to be masked, we just move them as a whole like 'merge' or 'prune' command, which would be a big saving. But yes, this operation would be slower than 'prune' command, but it still can save huge comparing with doing it via query engine. > Add 'mask' command to parquet-tools/parquet-cli > --- > > Key: PARQUET-1792 > URL: https://issues.apache.org/jira/browse/PARQUET-1792 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > Some personal data columns need to be masked instead of being > pruned(Parquet-1791). We need a tool to replace the raw data columns with > masked value. The masked value could be hash, null, redact etc. For the > unchanged columns, they should be moved as a whole like 'merge', 'prune' > command in Parquet-tools. > > Implementing this feature in file format is 10X faster than doing it by > rewriting the table data in the query engine. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (PARQUET-1792) Add 'mask' command to parquet-tools/parquet-cli
[ https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034553#comment-17034553 ] Xinli Shang edited comment on PARQUET-1792 at 2/11/20 3:47 PM: --- [~gszadovszky] the tool can be run in parallel in a cluster. For example, we can easily write a Spark application to do it. Actually even for 'prune', we still need to write Spark application to parallel it. Otherwise, the time to finish is still significant, although it is already faster than doing it in query engines. Regarding reading the original value and generating the hash/statistics, we only need to do it for the columns to be masked. In many cases, what we see is that there are only very few columns to be masked. For all other columns that don't need to be masked, we just move them as a whole like 'merge' or 'prune' command, which would be a big saving. But yes, this operation would be slower than 'prune' command, but it still can save huge comparing with doing it via query engine. was (Author: sha...@uber.com): [~gszadovszky] the tool can be run in parallel in a cluster. For example, we can easily write a Spark application to do it. Actually even for 'prune', we still need to write Spark application to parallel it. Otherwise, the time to finish is still significant, although it is already faster than doing it in query engines. Regarding reading the original value and generating the hash/statistics, we only need to do it for the columns to be masked. In many cases, what we see is that there are only very few columns to be masked. For all other columns that don't need to be masked, we just move them as a whole like 'merge' or 'prune' command, which would be a big saving. But yes, this operation would be slower than 'prune' command, but it still can save huge comparing with doing it via query engine. > Add 'mask' command to parquet-tools/parquet-cli > --- > > Key: PARQUET-1792 > URL: https://issues.apache.org/jira/browse/PARQUET-1792 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > Some personal data columns need to be masked instead of being > pruned(Parquet-1791). We need a tool to replace the raw data columns with > masked value. The masked value could be hash, null, redact etc. For the > unchanged columns, they should be moved as a whole like 'merge', 'prune' > command in Parquet-tools. > > Implementing this feature in file format is 10X faster than doing it by > rewriting the table data in the query engine. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSS] merge bloom filter branch to master
I think, the best way is to create a PR. It would be more transparent. On Tue, Feb 11, 2020 at 2:11 PM Junjie Chen wrote: > Thanks for the patch @Walid. > > @Gabor, do I need to create a PR for merging, or committer will help to > merge? > > > On Sun, Jan 12, 2020 at 7:39 PM Gara Walid wrote: > > > Hi Junjie, > > > > Thank you Junjie and the community for all your efforts on the bloom > filter > > topic. > > I've been testing the bloom filter branch with Map Reduce examples and it > > sounds good. > > I'll add an example to the parquet-hadoop package. > > > > Cheers, > > Walid > > > > Le ven. 10 janv. 2020 à 09:22, Junjie Chen a > > écrit : > > > > > Hi Community > > > > > > The bloom filter branch now contains a basic functional logic that > > includes > > > read/write and filtering. Though the feature still needs polishing and > > > improving, I 'd suggest merging back to master first so that more > people > > > could use it and provide feedback. What do you think? > > > > > > > > -- > Best Regards >
Re: [DISCUSS] merge bloom filter branch to master
Thanks for the patch @Walid. @Gabor, do I need to create a PR for merging, or committer will help to merge? On Sun, Jan 12, 2020 at 7:39 PM Gara Walid wrote: > Hi Junjie, > > Thank you Junjie and the community for all your efforts on the bloom filter > topic. > I've been testing the bloom filter branch with Map Reduce examples and it > sounds good. > I'll add an example to the parquet-hadoop package. > > Cheers, > Walid > > Le ven. 10 janv. 2020 à 09:22, Junjie Chen a > écrit : > > > Hi Community > > > > The bloom filter branch now contains a basic functional logic that > includes > > read/write and filtering. Though the feature still needs polishing and > > improving, I 'd suggest merging back to master first so that more people > > could use it and provide feedback. What do you think? > > > -- Best Regards
[jira] [Commented] (PARQUET-1792) Add 'mask' command to parquet-tools/parquet-cli
[ https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034416#comment-17034416 ] Gidon Gershinsky commented on PARQUET-1792: --- There is also a security aspect. While "prune" cleanly removes a sensitive column, and therefore is safe - "mask"/"redact" replaces one version of the column data with another version, and can easily leak sensitive information if not done properly. I believe that at this stage, its best done above Parquet - by the users, who can simply add columns with the masked data. It can be also faster than Parquet tools, if run on a multi-threaded engine, as mentioned by Gabor. We're working on a system that would allow to analyze Parquet files with masked/redacted columns, and detect information leakage. This would also allow to perform masking inside Parquet libraries, making it fast / multi-threaded. But this project will take a while to complete. It's not urgent though, since, again, masking/redaction can be easily implemented by the users today, above Parquet. > Add 'mask' command to parquet-tools/parquet-cli > --- > > Key: PARQUET-1792 > URL: https://issues.apache.org/jira/browse/PARQUET-1792 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > Some personal data columns need to be masked instead of being > pruned(Parquet-1791). We need a tool to replace the raw data columns with > masked value. The masked value could be hash, null, redact etc. For the > unchanged columns, they should be moved as a whole like 'merge', 'prune' > command in Parquet-tools. > > Implementing this feature in file format is 10X faster than doing it by > rewriting the table data in the query engine. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1793) Support writing INT96 timestamp from avro
Tamas Palfy created PARQUET-1793: Summary: Support writing INT96 timestamp from avro Key: PARQUET-1793 URL: https://issues.apache.org/jira/browse/PARQUET-1793 Project: Parquet Issue Type: Improvement Reporter: Tamas Palfy Add support for writing avro LONG/timestamp-millis data in INT96 (or int the current INT64) format in parquet. Add a config flag to select the required timestamp output format (INT96 or INT64). -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Using JsonRecordFormatter
Hi Gabor, Thanks a lot for the quick reply. I've been looking into AvroParquetReader this morning and it looks like a much better fit for my problem: ParquetReader pReader = AvroParquetReader.builder(localInputFile).build(); for (GenericData.Record value = (GenericData.Record) pReader.read(); value != null; value = (GenericData.Record) pReader.read()) { ... } Seems like it gets me data in the format I want. Regards, Ben On 2/11/20, Gabor Szadovszky wrote: > Hi Ben, > > SimpleRecord is pretty old and did not upgraded to a newer concept. You > need to extend ParquetReader for SimpleRecord. See AvroParquetReader for > details. > After you have to specified SimpleRecordReader and the related Builder you > can add the methods you need. > I also wanted to highlight that parquet-tools is not really for using it > from the code but from the command line. There are not proper unit tests > for that code and there are no guarantees for backward code compatibility > between the releases. I would not recommend using these code parts for > production. > > Cheers, > Gabor > > On Mon, Feb 10, 2020 at 9:55 PM Ben Watson wrote: > >> Hello, >> >> I'm wanting to read Parquet records into JSON with Java, and it seems >> that >> JsonRecordFormatter is the way to do it (like in >> >> https://github.com/apache/parquet-mr/blob/master/parquet-tools/src/main/java/org/apache/parquet/tools/command/CatCommand.java#L84 >> ). >> >> Unlike the above example, I want to avoid passing a Hadoop Path object, >> and >> instead I want to use the ParquetReader.read(InputFile).build(); builder. >> However this returns a ParquetReader, and not the >> ParquetReader that I need for JsonRecordFormatter. It looks >> like I need to insert a new SimpleReadSupport() somewhere, but I can't >> find >> any method in that builder that accepts it. >> >> I've tried looking for other usages online etc but haven't had any luck. >> Any pointers greatly appreciated. >> >> Regards, >> >> Ben >> >
Re: Using JsonRecordFormatter
Hi Ben, SimpleRecord is pretty old and did not upgraded to a newer concept. You need to extend ParquetReader for SimpleRecord. See AvroParquetReader for details. After you have to specified SimpleRecordReader and the related Builder you can add the methods you need. I also wanted to highlight that parquet-tools is not really for using it from the code but from the command line. There are not proper unit tests for that code and there are no guarantees for backward code compatibility between the releases. I would not recommend using these code parts for production. Cheers, Gabor On Mon, Feb 10, 2020 at 9:55 PM Ben Watson wrote: > Hello, > > I'm wanting to read Parquet records into JSON with Java, and it seems that > JsonRecordFormatter is the way to do it (like in > > https://github.com/apache/parquet-mr/blob/master/parquet-tools/src/main/java/org/apache/parquet/tools/command/CatCommand.java#L84 > ). > > Unlike the above example, I want to avoid passing a Hadoop Path object, and > instead I want to use the ParquetReader.read(InputFile).build(); builder. > However this returns a ParquetReader, and not the > ParquetReader that I need for JsonRecordFormatter. It looks > like I need to insert a new SimpleReadSupport() somewhere, but I can't find > any method in that builder that accepts it. > > I've tried looking for other usages online etc but haven't had any luck. > Any pointers greatly appreciated. > > Regards, > > Ben >
[jira] [Commented] (PARQUET-1792) Add 'mask' command to parquet-tools/parquet-cli
[ https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034294#comment-17034294 ] Gabor Szadovszky commented on PARQUET-1792: --- If you are talking about one file at a time you might be right that it is 10x faster than doing it by a query engine. But the tool is running on one node while the query engine uses several ones at the same time so I am not sure about the 10x performance. Pruning the file makes sense to me to be written at the library level because you can do it in an effective way (do not need to unpack/decode the pages or the entire column chunks). To mask the values in the other hand requires to read the actual values and to generate the hashes. You also need to generate the related statistics. Therefore, I am not sure if this masking feature properly suited for parquet-mr. > Add 'mask' command to parquet-tools/parquet-cli > --- > > Key: PARQUET-1792 > URL: https://issues.apache.org/jira/browse/PARQUET-1792 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > Some personal data columns need to be masked instead of being > pruned(Parquet-1791). We need a tool to replace the raw data columns with > masked value. The masked value could be hash, null, redact etc. For the > unchanged columns, they should be moved as a whole like 'merge', 'prune' > command in Parquet-tools. > > Implementing this feature in file format is 10X faster than doing it by > rewriting the table data in the query engine. -- This message was sent by Atlassian Jira (v8.3.4#803005)