Re: Column index testing break down

2019-03-07 Thread Wes McKinney
hi Tim,

On Thu, Mar 7, 2019 at 11:52 AM Tim Armstrong
 wrote:
>
> I think you and I have different priors on this, Wes. It's definitely not
> clear-cut. I think it's an interesting point to discuss and it's
> unfortunate that you feel that way.
>
> Partially the current state of things is due to path-dependence, but there
> are some parts of the Impala runtime that it's important for us to
> integrate with, e.g. I/O, codegen, memory management, error handling, query
> cancellation, etc. It's hard to integrate with those if we have an external
> library driving the Parquet scanning.
>

I am confident that these issues could be resolved to mutual
satisfaction if there was the desire to do so. It would probably
require some refactoring to externalize some internal interfaces (like
IO and memory management) in suitable form to plug into the Impala
runtime, but this would be a good exercise and result in a better
piece of software.

On the decoding hot path, I do not see why it would be impossible to
expose the intermediate encoded column chunks to Impala's LLVM IR in
sufficient form to populate RowBatch tuples with low overhead.

I am the author of the most lines of code in parquet-cpp and it gave
me no satisfaction to have to copy and paste so much code out of the
Impala codebase to initially build the project. In part as a result of
this duplicated effort, there are still parts of parquet-cpp that are
unfinished, like nested data conversion to and from the Arrow columnar
format.

> I agree if we think parquet-cpp is completely mutable and we can make
> whatever tweaks and add whatever hooks or interfaces to it that we need to
> get the behaviour we want, then we could get most of the advantages of
> tight coupling. But that might end up just being tight coupling with a
> library boundary in the middle, which is worse in many way. I'm also not
> sure that parquet-cpp is that mutable - at some point I think there would
> be resistance to adding hooks that nobody else wants or refactoring things
> in a way that worked for us. We may also prefer different trade-offs in
> complexity vs performance to the rest of the community.
>

The "original sin" here was the failure to develop a C++ library in
Apache Parquet that catered to the needs of Impala from the get-go. I
realize this decision predates both of our involvements in either
project, and so I feel that we've had to make lemons out of lemonade.
However it's at times like these when we are having to duplicate
efforts that the community is harmed by what I believe to be a mistake
of the past.

It would have been more effort to build a general purpose library from
day one, but it would also have made Apache Parquet more successful by
becoming more widely adopted earlier in the native-code consuming
community. I suspect that would have also had indirect benefits for
Apache Impala by improving the quality of the "funnel" of structured
data into places where Impala can query it.

- Wes

> There's also the question of whether the hoop-jumping required to do
> optimisations across library and project boundaries is lower effort than
> implementing things in the simplest way possible in one place but paying
> the cost of duplication. I can see why someone might come down on the other
> side on this point.
>
> For most open source libraries we use, we'd accept the trade-off of less
> control for the advantages of having a community around the library. But
> for stuff that's performance-critical, we're generally willing to invest
> more time into maintaining it.
>
> Interestingly we have a bit of a natural experiment going with the ORC
> scanner implementation which took the opposite approach. There's a huge gap
> in performance and the memory management is pretty bad (the Parquet
> implementation can do scans with a fixed memory budget, etc). Some of that
> is just an effort thing but once you dig into it there are some deeper
> issues that require the kind of hooks that I was talking about. I did a
> presentation recently that enumerated some of the challenges, might as well
> share:
> https://docs.google.com/presentation/d/1o-vAm933Q1WvApzSnYK0bpIhmJ0w1-9xdaKF2wu-7FQ/edit?usp=sharing
>
> Anyway, sorry for the diatribe, but I do find it an interesting discussion
> and am quite opinionated about it.
>
> - Tim
>
> On Thu, Mar 7, 2019 at 8:04 AM Wes McKinney  wrote:
>
> > It makes me very sad that Impala has this bespoke Parquet
> > implementation. I never really understood the benefit of doing the
> > work there rather than in Apache Parquet. I never found the arguments
> > "why" I've heard over the years (that the implementation needed to be
> > tightly coupled to the Impala runtime) to be persuasive. At this point
> > it is probably too late to do anything about it
> >
> > Thanks
> >
> > On Thu, Mar 7, 2019 at 9:58 AM Anna Szonyi 
> > wrote:
> > >
> > > Hi Wes,
> > >
> > > Zoltan has created a C++ implementation for Impala. We would be happy to
> > > contribute it to 

Re: Column index testing break down

2019-03-07 Thread Tim Armstrong
I think you and I have different priors on this, Wes. It's definitely not
clear-cut. I think it's an interesting point to discuss and it's
unfortunate that you feel that way.

Partially the current state of things is due to path-dependence, but there
are some parts of the Impala runtime that it's important for us to
integrate with, e.g. I/O, codegen, memory management, error handling, query
cancellation, etc. It's hard to integrate with those if we have an external
library driving the Parquet scanning.

I agree if we think parquet-cpp is completely mutable and we can make
whatever tweaks and add whatever hooks or interfaces to it that we need to
get the behaviour we want, then we could get most of the advantages of
tight coupling. But that might end up just being tight coupling with a
library boundary in the middle, which is worse in many way. I'm also not
sure that parquet-cpp is that mutable - at some point I think there would
be resistance to adding hooks that nobody else wants or refactoring things
in a way that worked for us. We may also prefer different trade-offs in
complexity vs performance to the rest of the community.

There's also the question of whether the hoop-jumping required to do
optimisations across library and project boundaries is lower effort than
implementing things in the simplest way possible in one place but paying
the cost of duplication. I can see why someone might come down on the other
side on this point.

For most open source libraries we use, we'd accept the trade-off of less
control for the advantages of having a community around the library. But
for stuff that's performance-critical, we're generally willing to invest
more time into maintaining it.

Interestingly we have a bit of a natural experiment going with the ORC
scanner implementation which took the opposite approach. There's a huge gap
in performance and the memory management is pretty bad (the Parquet
implementation can do scans with a fixed memory budget, etc). Some of that
is just an effort thing but once you dig into it there are some deeper
issues that require the kind of hooks that I was talking about. I did a
presentation recently that enumerated some of the challenges, might as well
share:
https://docs.google.com/presentation/d/1o-vAm933Q1WvApzSnYK0bpIhmJ0w1-9xdaKF2wu-7FQ/edit?usp=sharing

Anyway, sorry for the diatribe, but I do find it an interesting discussion
and am quite opinionated about it.

- Tim

On Thu, Mar 7, 2019 at 8:04 AM Wes McKinney  wrote:

> It makes me very sad that Impala has this bespoke Parquet
> implementation. I never really understood the benefit of doing the
> work there rather than in Apache Parquet. I never found the arguments
> "why" I've heard over the years (that the implementation needed to be
> tightly coupled to the Impala runtime) to be persuasive. At this point
> it is probably too late to do anything about it
>
> Thanks
>
> On Thu, Mar 7, 2019 at 9:58 AM Anna Szonyi 
> wrote:
> >
> > Hi Wes,
> >
> > Zoltan has created a C++ implementation for Impala. We would be happy to
> > contribute it to Parquet cpp when we have time or if someone is keen on
> > getting it in sooner and wants to take it over, we would be happy to
> review
> > it.
> > Feel free to check it out and chime in to the review for the Impala
> > implementation: https://gerrit.cloudera.org/#/c/12065/.
> >
> > Best,
> > Anna
> >
> > On Wed, Mar 6, 2019 at 4:17 PM Wes McKinney  wrote:
> >
> > > Is there anyone who might be able to take on the project of
> > > implementing this in C++? We're having an increasing number of C++
> > > Parquet users nowadays.
> > >
> > > On Tue, Mar 5, 2019 at 9:54 AM Anna Szonyi  >
> > > wrote:
> > > >
> > > > Hi dev@ community,
> > > >
> > > > This week I would like to ask for some feedback on the testing we've
> been
> > > > sending out.
> > > > We've been sharing the most important test cases we've created for
> the
> > > > write path of the parquet column index feature, now we would like to
> hear
> > > > from you!
> > > >
> > > > Is there anything else you feel is missing or would like to get
> clarity
> > > on?
> > > >
> > > > Thanks,
> > > > Anna
> > > >
> > > > On Mon, Feb 25, 2019 at 6:26 PM Anna Szonyi 
> wrote:
> > > >
> > > > > Hi dev@,
> > > > >
> > > > > After a week off, this week we have an excerpt from our internal
> data
> > > > > interoperability testing, which tests compatibility between Hive,
> > > Spark and
> > > > > Impala over Avro and Parquet. This test case is tailor-made to test
> > > > > specific layouts so that files written using parquet-mr can be
> read by
> > > any
> > > > > of the above mentioned components. We have also checked fault
> injection
> > > > > cases.
> > > > >
> > > > > The test suite is private currently, however we have made the test
> > > classes
> > > > > corresponding to the following document public:
> > > > >
> > >
> https://docs.google.com/document/d/1mHYQGXE4oM1zgg83MMc4ho1gmoJMeZcq9MWG99WgL3A
> > > > >
> > > > > Please find the 

Re: Column index testing break down

2019-03-07 Thread Wes McKinney
It makes me very sad that Impala has this bespoke Parquet
implementation. I never really understood the benefit of doing the
work there rather than in Apache Parquet. I never found the arguments
"why" I've heard over the years (that the implementation needed to be
tightly coupled to the Impala runtime) to be persuasive. At this point
it is probably too late to do anything about it

Thanks

On Thu, Mar 7, 2019 at 9:58 AM Anna Szonyi  wrote:
>
> Hi Wes,
>
> Zoltan has created a C++ implementation for Impala. We would be happy to
> contribute it to Parquet cpp when we have time or if someone is keen on
> getting it in sooner and wants to take it over, we would be happy to review
> it.
> Feel free to check it out and chime in to the review for the Impala
> implementation: https://gerrit.cloudera.org/#/c/12065/.
>
> Best,
> Anna
>
> On Wed, Mar 6, 2019 at 4:17 PM Wes McKinney  wrote:
>
> > Is there anyone who might be able to take on the project of
> > implementing this in C++? We're having an increasing number of C++
> > Parquet users nowadays.
> >
> > On Tue, Mar 5, 2019 at 9:54 AM Anna Szonyi 
> > wrote:
> > >
> > > Hi dev@ community,
> > >
> > > This week I would like to ask for some feedback on the testing we've been
> > > sending out.
> > > We've been sharing the most important test cases we've created for the
> > > write path of the parquet column index feature, now we would like to hear
> > > from you!
> > >
> > > Is there anything else you feel is missing or would like to get clarity
> > on?
> > >
> > > Thanks,
> > > Anna
> > >
> > > On Mon, Feb 25, 2019 at 6:26 PM Anna Szonyi  wrote:
> > >
> > > > Hi dev@,
> > > >
> > > > After a week off, this week we have an excerpt from our internal data
> > > > interoperability testing, which tests compatibility between Hive,
> > Spark and
> > > > Impala over Avro and Parquet. This test case is tailor-made to test
> > > > specific layouts so that files written using parquet-mr can be read by
> > any
> > > > of the above mentioned components. We have also checked fault injection
> > > > cases.
> > > >
> > > > The test suite is private currently, however we have made the test
> > classes
> > > > corresponding to the following document public:
> > > >
> > https://docs.google.com/document/d/1mHYQGXE4oM1zgg83MMc4ho1gmoJMeZcq9MWG99WgL3A
> > > >
> > > > Please find the test cases and their results here:
> > > > https://github.com/zivanfi/column-indexes-data-interop-tests-excerpts
> > > >
> > > > Best,
> > > > Anna
> > > >
> > > >
> > > >
> > > > On Mon, Feb 11, 2019 at 4:57 PM Anna Szonyi 
> > wrote:
> > > >
> > > >> Hi dev@,
> > > >>
> > > >> Last week we had a twofer: e2e tool and integration test validating
> > the
> > > >> contract of column indexes/indices (if all values are between min and
> > max
> > > >> and if set whether the boundary order is correct). There are some
> > takeaways
> > > >> and corrections to be made to the former (like the max->min typo) -
> > thanks
> > > >> for the feedback on that!
> > > >>
> > > >> The next installment is also an integration test that tests the
> > filtering
> > > >> logic on files including simple and special cases (user defined
> > function,
> > > >> complex filtering, no filtering, etc.).
> > > >>
> > > >>
> > > >>
> > https://github.com/apache/parquet-mr/blob/e7db9e20f52c925a207ea62d6dda6dc4e870294e/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestColumnIndexFiltering.java
> > > >>
> > > >> Please let me know if you have any questions/comments.
> > > >>
> > > >> Best,
> > > >> Anna
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> >


Re: Column index testing break down

2019-03-07 Thread Anna Szonyi
Hi Wes,

Zoltan has created a C++ implementation for Impala. We would be happy to
contribute it to Parquet cpp when we have time or if someone is keen on
getting it in sooner and wants to take it over, we would be happy to review
it.
Feel free to check it out and chime in to the review for the Impala
implementation: https://gerrit.cloudera.org/#/c/12065/.

Best,
Anna

On Wed, Mar 6, 2019 at 4:17 PM Wes McKinney  wrote:

> Is there anyone who might be able to take on the project of
> implementing this in C++? We're having an increasing number of C++
> Parquet users nowadays.
>
> On Tue, Mar 5, 2019 at 9:54 AM Anna Szonyi 
> wrote:
> >
> > Hi dev@ community,
> >
> > This week I would like to ask for some feedback on the testing we've been
> > sending out.
> > We've been sharing the most important test cases we've created for the
> > write path of the parquet column index feature, now we would like to hear
> > from you!
> >
> > Is there anything else you feel is missing or would like to get clarity
> on?
> >
> > Thanks,
> > Anna
> >
> > On Mon, Feb 25, 2019 at 6:26 PM Anna Szonyi  wrote:
> >
> > > Hi dev@,
> > >
> > > After a week off, this week we have an excerpt from our internal data
> > > interoperability testing, which tests compatibility between Hive,
> Spark and
> > > Impala over Avro and Parquet. This test case is tailor-made to test
> > > specific layouts so that files written using parquet-mr can be read by
> any
> > > of the above mentioned components. We have also checked fault injection
> > > cases.
> > >
> > > The test suite is private currently, however we have made the test
> classes
> > > corresponding to the following document public:
> > >
> https://docs.google.com/document/d/1mHYQGXE4oM1zgg83MMc4ho1gmoJMeZcq9MWG99WgL3A
> > >
> > > Please find the test cases and their results here:
> > > https://github.com/zivanfi/column-indexes-data-interop-tests-excerpts
> > >
> > > Best,
> > > Anna
> > >
> > >
> > >
> > > On Mon, Feb 11, 2019 at 4:57 PM Anna Szonyi 
> wrote:
> > >
> > >> Hi dev@,
> > >>
> > >> Last week we had a twofer: e2e tool and integration test validating
> the
> > >> contract of column indexes/indices (if all values are between min and
> max
> > >> and if set whether the boundary order is correct). There are some
> takeaways
> > >> and corrections to be made to the former (like the max->min typo) -
> thanks
> > >> for the feedback on that!
> > >>
> > >> The next installment is also an integration test that tests the
> filtering
> > >> logic on files including simple and special cases (user defined
> function,
> > >> complex filtering, no filtering, etc.).
> > >>
> > >>
> > >>
> https://github.com/apache/parquet-mr/blob/e7db9e20f52c925a207ea62d6dda6dc4e870294e/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestColumnIndexFiltering.java
> > >>
> > >> Please let me know if you have any questions/comments.
> > >>
> > >> Best,
> > >> Anna
> > >>
> > >>
> > >>
> > >>
> > >>
>


Re: Column index testing break down

2019-03-06 Thread Wes McKinney
Is there anyone who might be able to take on the project of
implementing this in C++? We're having an increasing number of C++
Parquet users nowadays.

On Tue, Mar 5, 2019 at 9:54 AM Anna Szonyi  wrote:
>
> Hi dev@ community,
>
> This week I would like to ask for some feedback on the testing we've been
> sending out.
> We've been sharing the most important test cases we've created for the
> write path of the parquet column index feature, now we would like to hear
> from you!
>
> Is there anything else you feel is missing or would like to get clarity on?
>
> Thanks,
> Anna
>
> On Mon, Feb 25, 2019 at 6:26 PM Anna Szonyi  wrote:
>
> > Hi dev@,
> >
> > After a week off, this week we have an excerpt from our internal data
> > interoperability testing, which tests compatibility between Hive, Spark and
> > Impala over Avro and Parquet. This test case is tailor-made to test
> > specific layouts so that files written using parquet-mr can be read by any
> > of the above mentioned components. We have also checked fault injection
> > cases.
> >
> > The test suite is private currently, however we have made the test classes
> > corresponding to the following document public:
> > https://docs.google.com/document/d/1mHYQGXE4oM1zgg83MMc4ho1gmoJMeZcq9MWG99WgL3A
> >
> > Please find the test cases and their results here:
> > https://github.com/zivanfi/column-indexes-data-interop-tests-excerpts
> >
> > Best,
> > Anna
> >
> >
> >
> > On Mon, Feb 11, 2019 at 4:57 PM Anna Szonyi  wrote:
> >
> >> Hi dev@,
> >>
> >> Last week we had a twofer: e2e tool and integration test validating the
> >> contract of column indexes/indices (if all values are between min and max
> >> and if set whether the boundary order is correct). There are some takeaways
> >> and corrections to be made to the former (like the max->min typo) - thanks
> >> for the feedback on that!
> >>
> >> The next installment is also an integration test that tests the filtering
> >> logic on files including simple and special cases (user defined function,
> >> complex filtering, no filtering, etc.).
> >>
> >>
> >> https://github.com/apache/parquet-mr/blob/e7db9e20f52c925a207ea62d6dda6dc4e870294e/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestColumnIndexFiltering.java
> >>
> >> Please let me know if you have any questions/comments.
> >>
> >> Best,
> >> Anna
> >>
> >>
> >>
> >>
> >>


Re: Column index testing break down

2019-03-05 Thread Anna Szonyi
Hi dev@ community,

This week I would like to ask for some feedback on the testing we've been
sending out.
We've been sharing the most important test cases we've created for the
write path of the parquet column index feature, now we would like to hear
from you!

Is there anything else you feel is missing or would like to get clarity on?

Thanks,
Anna

On Mon, Feb 25, 2019 at 6:26 PM Anna Szonyi  wrote:

> Hi dev@,
>
> After a week off, this week we have an excerpt from our internal data
> interoperability testing, which tests compatibility between Hive, Spark and
> Impala over Avro and Parquet. This test case is tailor-made to test
> specific layouts so that files written using parquet-mr can be read by any
> of the above mentioned components. We have also checked fault injection
> cases.
>
> The test suite is private currently, however we have made the test classes
> corresponding to the following document public:
> https://docs.google.com/document/d/1mHYQGXE4oM1zgg83MMc4ho1gmoJMeZcq9MWG99WgL3A
>
> Please find the test cases and their results here:
> https://github.com/zivanfi/column-indexes-data-interop-tests-excerpts
>
> Best,
> Anna
>
>
>
> On Mon, Feb 11, 2019 at 4:57 PM Anna Szonyi  wrote:
>
>> Hi dev@,
>>
>> Last week we had a twofer: e2e tool and integration test validating the
>> contract of column indexes/indices (if all values are between min and max
>> and if set whether the boundary order is correct). There are some takeaways
>> and corrections to be made to the former (like the max->min typo) - thanks
>> for the feedback on that!
>>
>> The next installment is also an integration test that tests the filtering
>> logic on files including simple and special cases (user defined function,
>> complex filtering, no filtering, etc.).
>>
>>
>> https://github.com/apache/parquet-mr/blob/e7db9e20f52c925a207ea62d6dda6dc4e870294e/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestColumnIndexFiltering.java
>>
>> Please let me know if you have any questions/comments.
>>
>> Best,
>> Anna
>>
>>
>>
>>
>>


Re: Column index testing break down

2019-02-25 Thread Anna Szonyi
Hi dev@,

After a week off, this week we have an excerpt from our internal data
interoperability testing, which tests compatibility between Hive, Spark and
Impala over Avro and Parquet. This test case is tailor-made to test
specific layouts so that files written using parquet-mr can be read by any
of the above mentioned components. We have also checked fault injection
cases.

The test suite is private currently, however we have made the test classes
corresponding to the following document public:
https://docs.google.com/document/d/1mHYQGXE4oM1zgg83MMc4ho1gmoJMeZcq9MWG99WgL3A

Please find the test cases and their results here:
https://github.com/zivanfi/column-indexes-data-interop-tests-excerpts

Best,
Anna



On Mon, Feb 11, 2019 at 4:57 PM Anna Szonyi  wrote:

> Hi dev@,
>
> Last week we had a twofer: e2e tool and integration test validating the
> contract of column indexes/indices (if all values are between min and max
> and if set whether the boundary order is correct). There are some takeaways
> and corrections to be made to the former (like the max->min typo) - thanks
> for the feedback on that!
>
> The next installment is also an integration test that tests the filtering
> logic on files including simple and special cases (user defined function,
> complex filtering, no filtering, etc.).
>
>
> https://github.com/apache/parquet-mr/blob/e7db9e20f52c925a207ea62d6dda6dc4e870294e/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestColumnIndexFiltering.java
>
> Please let me know if you have any questions/comments.
>
> Best,
> Anna
>
>
>
>
>


Re: Column index testing break down

2019-02-11 Thread Anna Szonyi
Hi dev@,

Last week we had a twofer: e2e tool and integration test validating the
contract of column indexes/indices (if all values are between min and max
and if set whether the boundary order is correct). There are some takeaways
and corrections to be made to the former (like the max->min typo) - thanks
for the feedback on that!

The next installment is also an integration test that tests the filtering
logic on files including simple and special cases (user defined function,
complex filtering, no filtering, etc.).

https://github.com/apache/parquet-mr/blob/e7db9e20f52c925a207ea62d6dda6dc4e870294e/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestColumnIndexFiltering.java

Please let me know if you have any questions/comments.

Best,
Anna


Re: Column index testing break down

2019-02-06 Thread Anna Szonyi
Hi Ryan,

Thanks for the quick feedback! We also have an integration test version of
the contract validation, which uses in-memory data and validates the
min/max/boundaryorder, as well as the correctness of the filtering - this
would have been the next installment, so *spoiler alert* :)

https://github.com/apache/parquet-mr/blob/e7db9e20f52c925a207ea62d6dda6dc4e870294e/parquet-column/src/test/java/org/apache/parquet/internal/column/columnindex/TestColumnIndexBuilder.java

The test cases cover both the contract validation, as well as the filtering
for all types - a more extensive test suite than the tool.

For the tool/e2e case, most of the comments were improvements and
extensions, not lack of functionality - please correct me if I'm wrong.
While I agree that improvements can and will be made, are any of these
blockers?

Thanks,
Anna

On Tue, Feb 5, 2019 at 7:17 PM Ryan Blue  wrote:

> Thanks, Anna and Zoltan.
>
> Overall, I think the test does validate most of the contract, but needs to
> be improved. I found a few problems and commented on the commit. I would
> also like to see this as a real integration test in Parquet that is run to
> test the build. Why make this an external test?
>
> rb
>
> On Mon, Feb 4, 2019 at 8:42 AM Anna Szonyi 
> wrote:
>
> > Hi dev@,
> >
> > As part of acquiring the final sign offs for the parquet release, we had
> > another discussion with Ryan about the testing of column indexes.
> > He asked us to break down our testing into "fun size" test bites and send
> > them out incrementally to the community, so everyone would have time to
> > digest and focus on the testing case by case.
> >
> > We also agreed that we will be focusing on the write path, since that is
> > the critical part that we have to support for the long-haul.
> >
> > I will be sending out the related tests, please feel free to ask
> clarifying
> > questions or if you have a particular test case you want to see next,
> > please let me know.
> >
> > As a first "taste" Zoltan created the following tool to *validate the
> > contract* for the indices with random generated data:
> >
> > - The min value stored in the index must be less than or equal to all
> >   values.
> > - The max value stored in the index must be greater than or equal to all
> >   values.
> > - The null count stored in the index must be equal to the number of
> >   nulls.
> > - Only pages consisting entirely of NULL-s can be marked as a null page
> >   in the index.
> > - According to the ASCENDING boundary order, the min value for a page
> >   must be greater than or equal to the min value of the previous page.
> > - According to the ASCENDING boundary order, the max value for a page
> >   must be greater than or equal to the max value of the previous page.
> > - According to the DESCENDING boundary order, the min value for a page
> >   must be less than or equal to the min value of the previous page.
> > - According to the DESCENDING boundary order, the max value for a page
> >   must be less than or equal to the max value of the previous page.
> >
> >
> >
> https://github.com/zivanfi/parquet-mr/commit/0e74b4207daa0b53cf4bd0774d2632c388845cb9
> > The code itself is unpolished, but fully functional and complete.
> > Once the release is signed off we plan to refactor this and offer it as
> > part of parquet tools or parquet cli,
> > however it is perfectly fine for validation as-is.
> >
> > Best,
> > Anna
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>