Thanks everyone for contributing. I agree with Micah’s ideas and just to add a small inline thought.
> One thing we could that might move the burden on to implementations rather > than some central CI job (which is a substantial effort, I agree, having > worked with the arrow ne) Goode idea. Perhaps we can also add a section in https://github.com/apache/parquet-site/pull/34/files where implementation maintainers can also self-report their integration with other implementations (+ any limitations) if known. That way there could be central status report plus maybe links to specific implementations’ compatibility matrices, CI reports, or however the maintainers choose to provide the information. -- Best Regards, Muhammad Haseeb From: Micah Kornfield <emkornfi...@gmail.com> Date: Thursday, May 30, 2024 at 12:45 AM To: dev@parquet.apache.org <dev@parquet.apache.org> Subject: Re: [DISCUSS] Integration testing External email: Use caution opening links or attachments Thanks everyone for chiming in, some thoughts inline. > One thing we could that might move the burden on to implementations rather > than some central CI job (which is a substantial effort, I agree, having > worked with the arrow ne) I think this is a great idea. the main downside is this potentially runs into double breakages but hopefully we can resolve those quickly. This misses plenty of potential nuance, but it would likely cover most of > the basic "can this implementation read files" type questions Yes, I agree, I think something is better than nothing. Testing Parquet interoperability could easily get into a combinatorial > explosion of optional features, encodings, etc. I agree, and I think we can address this incrementally. First simply having a matrix of data types x encodings for that data would be a much better place than we are now. Ideally with data we could also cover checks on statistics as well. I think we could maybe have a separate set of tests then for "footer" understanding. For footers there might need to be both more "lint" style checks indicating missing or extraneous metadata in addition to pure compatibility checks. On Tue, May 28, 2024 at 6:36 AM Andrew Lamb <andrewlam...@gmail.com> wrote: > One thing we could that might move the burden on to implementations rather > than some central CI job (which is a substantial effort, I agree, having > worked with the arrow ne) > > Perhaps we could start simply with "reader compatibility" with the > existing files in parquet-testing[1] > > 1. Define a JSON file format with expected results > 2. Document how readers should generate that expected JSON file > > Then to determine compatibility with each "feature" an implementation would > show it could read and create the expected JSON file. > > This misses plenty of potential nuance, but it would likely cover most of > the basic "can this implementation read files" type questions > > Andrew > > [1] > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-testing&data=05%7C02%7Cmhaseeb%40nvidia.com%7C35a3e5a8d46d4d5d4e9808dc807c767f%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638526519258858170%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=mHtnf02IaacpT%2FKBf1GIyIAnNM0Dw1L58gy7vsAMjf0%3D&reserved=0<https://github.com/apache/parquet-testing> > > On Tue, May 28, 2024 at 8:01 AM Antoine Pitrou <anto...@python.org> wrote: > > > > > Hello, > > > > On Mon, 27 May 2024 22:46:45 -0700 > > Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > > > > 2. Is anybody interested in looking more deeply into developing > > > integration tests between the different Parquet implementations and > major > > > down-stream consumers of Parquet? I believe Apache arrow has a pretty > > good > > > model [3][4] in a lot of respects with cross-language integration > tests, > > > and nightly (via crossbow) integration tests with other consumers, but > > > there are a wide variety of things that would improve the current > state. > > > One other possible concern is the amount of CI resources this might > > > consume, and if we will need contributions to fund it. > > > > Caveat: Arrow has a lot less parameters to test for. The variability is > > mostly one-dimensional and falls under the data type rubric. As a > > matter of fact, other Arrow features such as compression or delta > > dictionaries are less well-tested. > > > > Testing Parquet interoperability could easily get into a combinatorial > > explosion of optional features, encodings, etc. > > > > I'm not saying that it shouldn't be done, but it may require a different > > approach than Arrow's approach of building and testing all > > implementations against each other in a single CI job. > > > > Regards > > > > Antoine. > > > > > > >