Re: Batch Sizing for Parquet Flat Reader

2018-03-25 Thread salim achouche
I have updated the document

with more design details.

On Thu, Feb 8, 2018 at 5:42 PM, salim achouche  wrote:

> The following document
> 
>  describes
> a proposal for enforcing batch sizing constraints (count and memory) within
> the Parquet Reader (Flat Schema). Please feel free to take a look and
> provide feedback.
>
> Thanks!
>
> Regards,
> Salim
>


Re: Batch Sizing for Parquet Flat Reader

2018-03-04 Thread Aman Sinha
Hi Paul,
thanks for your comments.  I have added my thoughts in the DRILL-6147 JIRA
as well.  Regarding the hangout, let me find out about availability of
other folks too and will circle back with you.

thanks,
Aman

On Sun, Mar 4, 2018 at 1:23 PM, Paul Rogers 
wrote:

> Hi Aman,
> To follow up, we should look at all sides of the issue. One factor
> overlooked in my previous note is that code now is better than code later.
> DRILL-6147 is available today and will immediately give users a
> performance boost. The result set loader is large and will take some months
> to commit, and so can't offer a benefit until then.
> It is hard to argue that we wait. Let's get DRILL-6147 in now, then
> revisit the issue later (doing the proposed test) once the result set
> loader is available.
> And, as discussed, DRILL-6147 works only for the flat Parquet reader.
> We'll need the result set loader for the Parquet reader that reads nested
> types.
>
> Thanks,
> - Paul
>
>
>
> On Sunday, March 4, 2018, 1:07:38 PM PST, Paul Rogers
>  wrote:
>
>  Hi Aman,
> Please see my comment in DRILL-6147.
> For the hangout to be productive, perhaps we should create test cases that
> will show the benefit of DRILL-6147 relative to the result set loader.
> The test case of interest has three parts:
> * Multiple variable-width fields (say five) with a large variance in field
> widths in each field
> * Large data set that will be split across multiple batches (say 10 or 50
> batches)
> * Constraints on total batch size and size of the largest vector
> Clearly, we can't try this out with Parquet: that's the topic we are
> discussing.
> But, we can generate a data set in code, then do a unit test of the two
> methods (just the vector loading bits) and time the result. Similar code
> already exists in the result set loader branch that can be repurposed for
> this use. We'd want to create a similar test for the DRILL-6147 mechanisms.
> We can work out the details in a separate discussion.
> IHMO, if the results are the same, we should go with one solution. If
> DRILL-6147 is significantly faster, the decision is clear: we have two
> solutions.
> We also should consider things such as selection, null columns, implicit
> columns, and the other higher-level functionality provided by the result
> set loader. Since Parquet already has ad-hoc solutions for these, with
> DRILL-6147 we'd simply keep those solutions for Parquet, while the other
> readers use the new, unified mechanisms.
> In terms of time, this week is busy:
> * Wed. 3PM or later* Fri. 3PM or later
> The week of the 12th is much more open.
> Thanks,
> - Paul
>
>
>
> On Sunday, March 4, 2018, 11:48:33 AM PST, Aman Sinha <
> amansi...@apache.org> wrote:
>
>  Hi all,  with reference to DRILL-6147
>  given the overlapping
> approaches,  I feel like we should have a separate hangout session with
> interested parties and discuss the details.
> Let me know and I can setup one.
>
> Aman
>
> On Mon, Feb 12, 2018 at 8:50 AM, Padma Penumarthy 
> wrote:
>
> > If our goal is to not to allocate more than 16MB for individual vectors
> to
> > avoid external fragmentation, I guess
> > we can take that also into consideration in our calculations to figure
> out
> > the outgoing number of rows.
> > The math might become more complex. But, the main point like you said is
> > operators know what they are
> > getting and can figure out how to deal with that to honor the constraints
> > imposed.
> >
> > Thanks
> > Padma
> >
> >
> > On Feb 12, 2018, at 8:25 AM, Paul Rogers  > mailto:par0...@yahoo.com.INVALID>> wrote:
> >
> > Agreed that allocating vectors up front is another good improvement.
> > The average batch size approach gets us 80% of the way to the goal: it
> > limits batch size and allows vector preallocation.
> > What it cannot do is limit individual vector sizes. Nor can it ensure
> that
> > the resulting batch is optimally loaded with data. Getting the remaining
> > 20% requires the level of detail provided by the result set loader.
> > We are driving to use the result set loader first in readers, since
> > readers can't use the average batch size (they don't have an input batch
> to
> > use to obtain sizes.)
> > To use the result set loader in non-leaf operators, we'd need to modify
> > code generation. AFAIK, that is not something anyone is working on, so
> > another advantage of the average batch size method is that it works with
> > the code generation we already have.
> > Thanks,
> > - Paul
> >
> >
> >
> >On Sunday, February 11, 2018, 7:28:52 PM PST, Padma Penumarthy <
> > ppenumar...@mapr.com> wrote:
> >
> > With average row size method, since I know number of rows and the average
> > size for each column,
> > I am planning to use that information to allocate required memory for
> each
> > vector 

Re: Batch Sizing for Parquet Flat Reader

2018-03-04 Thread Paul Rogers
Hi Aman,
To follow up, we should look at all sides of the issue. One factor overlooked 
in my previous note is that code now is better than code later.
DRILL-6147 is available today and will immediately give users a performance 
boost. The result set loader is large and will take some months to commit, and 
so can't offer a benefit until then.
It is hard to argue that we wait. Let's get DRILL-6147 in now, then revisit the 
issue later (doing the proposed test) once the result set loader is available.
And, as discussed, DRILL-6147 works only for the flat Parquet reader. We'll 
need the result set loader for the Parquet reader that reads nested types.

Thanks,
- Paul

 

On Sunday, March 4, 2018, 1:07:38 PM PST, Paul Rogers 
 wrote:  
 
 Hi Aman,
Please see my comment in DRILL-6147.
For the hangout to be productive, perhaps we should create test cases that will 
show the benefit of DRILL-6147 relative to the result set loader.
The test case of interest has three parts:
* Multiple variable-width fields (say five) with a large variance in field 
widths in each field
* Large data set that will be split across multiple batches (say 10 or 50 
batches)
* Constraints on total batch size and size of the largest vector
Clearly, we can't try this out with Parquet: that's the topic we are discussing.
But, we can generate a data set in code, then do a unit test of the two methods 
(just the vector loading bits) and time the result. Similar code already exists 
in the result set loader branch that can be repurposed for this use. We'd want 
to create a similar test for the DRILL-6147 mechanisms. We can work out the 
details in a separate discussion.
IHMO, if the results are the same, we should go with one solution. If 
DRILL-6147 is significantly faster, the decision is clear: we have two 
solutions.
We also should consider things such as selection, null columns, implicit 
columns, and the other higher-level functionality provided by the result set 
loader. Since Parquet already has ad-hoc solutions for these, with DRILL-6147 
we'd simply keep those solutions for Parquet, while the other readers use the 
new, unified mechanisms.
In terms of time, this week is busy:
* Wed. 3PM or later* Fri. 3PM or later
The week of the 12th is much more open.
Thanks,
- Paul

 

    On Sunday, March 4, 2018, 11:48:33 AM PST, Aman Sinha 
 wrote:  
 
 Hi all,  with reference to DRILL-6147
 given the overlapping
approaches,  I feel like we should have a separate hangout session with
interested parties and discuss the details.
Let me know and I can setup one.

Aman

On Mon, Feb 12, 2018 at 8:50 AM, Padma Penumarthy 
wrote:

> If our goal is to not to allocate more than 16MB for individual vectors to
> avoid external fragmentation, I guess
> we can take that also into consideration in our calculations to figure out
> the outgoing number of rows.
> The math might become more complex. But, the main point like you said is
> operators know what they are
> getting and can figure out how to deal with that to honor the constraints
> imposed.
>
> Thanks
> Padma
>
>
> On Feb 12, 2018, at 8:25 AM, Paul Rogers  mailto:par0...@yahoo.com.INVALID>> wrote:
>
> Agreed that allocating vectors up front is another good improvement.
> The average batch size approach gets us 80% of the way to the goal: it
> limits batch size and allows vector preallocation.
> What it cannot do is limit individual vector sizes. Nor can it ensure that
> the resulting batch is optimally loaded with data. Getting the remaining
> 20% requires the level of detail provided by the result set loader.
> We are driving to use the result set loader first in readers, since
> readers can't use the average batch size (they don't have an input batch to
> use to obtain sizes.)
> To use the result set loader in non-leaf operators, we'd need to modify
> code generation. AFAIK, that is not something anyone is working on, so
> another advantage of the average batch size method is that it works with
> the code generation we already have.
> Thanks,
> - Paul
>
>
>
>    On Sunday, February 11, 2018, 7:28:52 PM PST, Padma Penumarthy <
> ppenumar...@mapr.com> wrote:
>
> With average row size method, since I know number of rows and the average
> size for each column,
> I am planning to use that information to allocate required memory for each
> vector upfront.
> This should help avoid copying every time we double and also improve
> memory utilization.
>
> Thanks
> Padma
>
>
> On Feb 11, 2018, at 3:44 PM, Paul Rogers  mailto:par0...@yahoo.com.INVALID>> wrote:
>
> One more thought:
> 3) Assuming that you go with the average batch size calculation approach,
>
> The average batch size approach is a quick and dirty approach for non-leaf
> operators that can observe an incoming batch to estimate row width. 

Re: Batch Sizing for Parquet Flat Reader

2018-03-04 Thread Paul Rogers
Hi Aman,
Please see my comment in DRILL-6147.
For the hangout to be productive, perhaps we should create test cases that will 
show the benefit of DRILL-6147 relative to the result set loader.
The test case of interest has three parts:
* Multiple variable-width fields (say five) with a large variance in field 
widths in each field* Large data set that will be split across multiple batches 
(say 10 or 50 batches)* Constraints on total batch size and size of the largest 
vector
Clearly, we can't try this out with Parquet: that's the topic we are discussing.
But, we can generate a data set in code, then do a unit test of the two methods 
(just the vector loading bits) and time the result. Similar code already exists 
in the result set loader branch that can be repurposed for this use. We'd want 
to create a similar test for the DRILL-6147 mechanisms. We can work out the 
details in a separate discussion.
IHMO, if the results are the same, we should go with one solution. If 
DRILL-6147 is significantly faster, the decision is clear: we have two 
solutions.
We also should consider things such as selection, null columns, implicit 
columns, and the other higher-level functionality provided by the result set 
loader. Since Parquet already has ad-hoc solutions for these, with DRILL-6147 
we'd simply keep those solutions for Parquet, while the other readers use the 
new, unified mechanisms.
In terms of time, this week is busy:
* Wed. 3PM or later* Fri. 3PM or later
The week of the 12th is much more open.
Thanks,
- Paul

 

On Sunday, March 4, 2018, 11:48:33 AM PST, Aman Sinha 
 wrote:  
 
 Hi all,  with reference to DRILL-6147
 given the overlapping
approaches,  I feel like we should have a separate hangout session with
interested parties and discuss the details.
Let me know and I can setup one.

Aman

On Mon, Feb 12, 2018 at 8:50 AM, Padma Penumarthy 
wrote:

> If our goal is to not to allocate more than 16MB for individual vectors to
> avoid external fragmentation, I guess
> we can take that also into consideration in our calculations to figure out
> the outgoing number of rows.
> The math might become more complex. But, the main point like you said is
> operators know what they are
> getting and can figure out how to deal with that to honor the constraints
> imposed.
>
> Thanks
> Padma
>
>
> On Feb 12, 2018, at 8:25 AM, Paul Rogers  mailto:par0...@yahoo.com.INVALID>> wrote:
>
> Agreed that allocating vectors up front is another good improvement.
> The average batch size approach gets us 80% of the way to the goal: it
> limits batch size and allows vector preallocation.
> What it cannot do is limit individual vector sizes. Nor can it ensure that
> the resulting batch is optimally loaded with data. Getting the remaining
> 20% requires the level of detail provided by the result set loader.
> We are driving to use the result set loader first in readers, since
> readers can't use the average batch size (they don't have an input batch to
> use to obtain sizes.)
> To use the result set loader in non-leaf operators, we'd need to modify
> code generation. AFAIK, that is not something anyone is working on, so
> another advantage of the average batch size method is that it works with
> the code generation we already have.
> Thanks,
> - Paul
>
>
>
>    On Sunday, February 11, 2018, 7:28:52 PM PST, Padma Penumarthy <
> ppenumar...@mapr.com> wrote:
>
> With average row size method, since I know number of rows and the average
> size for each column,
> I am planning to use that information to allocate required memory for each
> vector upfront.
> This should help avoid copying every time we double and also improve
> memory utilization.
>
> Thanks
> Padma
>
>
> On Feb 11, 2018, at 3:44 PM, Paul Rogers  mailto:par0...@yahoo.com.INVALID>> wrote:
>
> One more thought:
> 3) Assuming that you go with the average batch size calculation approach,
>
> The average batch size approach is a quick and dirty approach for non-leaf
> operators that can observe an incoming batch to estimate row width. Because
> Drill batches are large, the law of large numbers means that the average of
> a large input batch is likely to be a good estimator for the average size
> of a large output batch.
> Note that this works only because non-leaf operators have an input batch
> to sample. Leaf operators (readers) do not have this luxury. Hence the
> result set loader uses the actual accumulated size for the current batch.
> Also note that the average row method, while handy, is not optimal. It
> will, in general, result in greater internal fragmentation than the result
> set loader. Why? The result set loader packs vectors right up to the point
> where the largest would overflow. The average row method works at the
> aggregate level and will likely result in wasted space 

Re: Batch Sizing for Parquet Flat Reader

2018-03-04 Thread Aman Sinha
Hi all,  with reference to DRILL-6147
 given the overlapping
approaches,  I feel like we should have a separate hangout session with
interested parties and discuss the details.
Let me know and I can setup one.

Aman

On Mon, Feb 12, 2018 at 8:50 AM, Padma Penumarthy 
wrote:

> If our goal is to not to allocate more than 16MB for individual vectors to
> avoid external fragmentation, I guess
> we can take that also into consideration in our calculations to figure out
> the outgoing number of rows.
> The math might become more complex. But, the main point like you said is
> operators know what they are
> getting and can figure out how to deal with that to honor the constraints
> imposed.
>
> Thanks
> Padma
>
>
> On Feb 12, 2018, at 8:25 AM, Paul Rogers  mailto:par0...@yahoo.com.INVALID>> wrote:
>
> Agreed that allocating vectors up front is another good improvement.
> The average batch size approach gets us 80% of the way to the goal: it
> limits batch size and allows vector preallocation.
> What it cannot do is limit individual vector sizes. Nor can it ensure that
> the resulting batch is optimally loaded with data. Getting the remaining
> 20% requires the level of detail provided by the result set loader.
> We are driving to use the result set loader first in readers, since
> readers can't use the average batch size (they don't have an input batch to
> use to obtain sizes.)
> To use the result set loader in non-leaf operators, we'd need to modify
> code generation. AFAIK, that is not something anyone is working on, so
> another advantage of the average batch size method is that it works with
> the code generation we already have.
> Thanks,
> - Paul
>
>
>
>On Sunday, February 11, 2018, 7:28:52 PM PST, Padma Penumarthy <
> ppenumar...@mapr.com> wrote:
>
> With average row size method, since I know number of rows and the average
> size for each column,
> I am planning to use that information to allocate required memory for each
> vector upfront.
> This should help avoid copying every time we double and also improve
> memory utilization.
>
> Thanks
> Padma
>
>
> On Feb 11, 2018, at 3:44 PM, Paul Rogers  mailto:par0...@yahoo.com.INVALID>> wrote:
>
> One more thought:
> 3) Assuming that you go with the average batch size calculation approach,
>
> The average batch size approach is a quick and dirty approach for non-leaf
> operators that can observe an incoming batch to estimate row width. Because
> Drill batches are large, the law of large numbers means that the average of
> a large input batch is likely to be a good estimator for the average size
> of a large output batch.
> Note that this works only because non-leaf operators have an input batch
> to sample. Leaf operators (readers) do not have this luxury. Hence the
> result set loader uses the actual accumulated size for the current batch.
> Also note that the average row method, while handy, is not optimal. It
> will, in general, result in greater internal fragmentation than the result
> set loader. Why? The result set loader packs vectors right up to the point
> where the largest would overflow. The average row method works at the
> aggregate level and will likely result in wasted space (internal
> fragmentation) in the largest vector. Said another way, with the average
> row size method, we can usually pack in a few more rows before the batch
> actually fills, and so we end up with batches with lower "density" than the
> optimal. This is important when the consuming operator is a buffering one
> such as sort.
> The key reason Padma is using the quick & dirty average row size method is
> not that it is ideal (it is not), but rather that it is, in fact, quick.
> We do want to move to the result set loader over time so we get improved
> memory utilization. And, it is the only way to control row size in readers
> such as CSV or JSON in which we have no size information until we read the
> data.
> - Paul
>
>


Re: Batch Sizing for Parquet Flat Reader

2018-02-12 Thread Padma Penumarthy
If our goal is to not to allocate more than 16MB for individual vectors to 
avoid external fragmentation, I guess
we can take that also into consideration in our calculations to figure out the 
outgoing number of rows.
The math might become more complex. But, the main point like you said is 
operators know what they are
getting and can figure out how to deal with that to honor the constraints 
imposed.

Thanks
Padma


On Feb 12, 2018, at 8:25 AM, Paul Rogers 
> wrote:

Agreed that allocating vectors up front is another good improvement.
The average batch size approach gets us 80% of the way to the goal: it limits 
batch size and allows vector preallocation.
What it cannot do is limit individual vector sizes. Nor can it ensure that the 
resulting batch is optimally loaded with data. Getting the remaining 20% 
requires the level of detail provided by the result set loader.
We are driving to use the result set loader first in readers, since readers 
can't use the average batch size (they don't have an input batch to use to 
obtain sizes.)
To use the result set loader in non-leaf operators, we'd need to modify code 
generation. AFAIK, that is not something anyone is working on, so another 
advantage of the average batch size method is that it works with the code 
generation we already have.
Thanks,
- Paul



   On Sunday, February 11, 2018, 7:28:52 PM PST, Padma Penumarthy 
> wrote:

With average row size method, since I know number of rows and the average size 
for each column,
I am planning to use that information to allocate required memory for each 
vector upfront.
This should help avoid copying every time we double and also improve memory 
utilization.

Thanks
Padma


On Feb 11, 2018, at 3:44 PM, Paul Rogers 
> wrote:

One more thought:
3) Assuming that you go with the average batch size calculation approach,

The average batch size approach is a quick and dirty approach for non-leaf 
operators that can observe an incoming batch to estimate row width. Because 
Drill batches are large, the law of large numbers means that the average of a 
large input batch is likely to be a good estimator for the average size of a 
large output batch.
Note that this works only because non-leaf operators have an input batch to 
sample. Leaf operators (readers) do not have this luxury. Hence the result set 
loader uses the actual accumulated size for the current batch.
Also note that the average row method, while handy, is not optimal. It will, in 
general, result in greater internal fragmentation than the result set loader. 
Why? The result set loader packs vectors right up to the point where the 
largest would overflow. The average row method works at the aggregate level and 
will likely result in wasted space (internal fragmentation) in the largest 
vector. Said another way, with the average row size method, we can usually pack 
in a few more rows before the batch actually fills, and so we end up with 
batches with lower "density" than the optimal. This is important when the 
consuming operator is a buffering one such as sort.
The key reason Padma is using the quick & dirty average row size method is not 
that it is ideal (it is not), but rather that it is, in fact, quick.
We do want to move to the result set loader over time so we get improved memory 
utilization. And, it is the only way to control row size in readers such as CSV 
or JSON in which we have no size information until we read the data.
- Paul



Re: Batch Sizing for Parquet Flat Reader

2018-02-12 Thread Paul Rogers
Agreed that allocating vectors up front is another good improvement.
The average batch size approach gets us 80% of the way to the goal: it limits 
batch size and allows vector preallocation.
What it cannot do is limit individual vector sizes. Nor can it ensure that the 
resulting batch is optimally loaded with data. Getting the remaining 20% 
requires the level of detail provided by the result set loader.
We are driving to use the result set loader first in readers, since readers 
can't use the average batch size (they don't have an input batch to use to 
obtain sizes.)
To use the result set loader in non-leaf operators, we'd need to modify code 
generation. AFAIK, that is not something anyone is working on, so another 
advantage of the average batch size method is that it works with the code 
generation we already have.
Thanks,
- Paul

 

On Sunday, February 11, 2018, 7:28:52 PM PST, Padma Penumarthy 
 wrote:  
 
 With average row size method, since I know number of rows and the average size 
for each column, 
I am planning to use that information to allocate required memory for each 
vector upfront. 
This should help avoid copying every time we double and also improve memory 
utilization.

Thanks
Padma


> On Feb 11, 2018, at 3:44 PM, Paul Rogers  wrote:
> 
> One more thought:
>>> 3) Assuming that you go with the average batch size calculation approach,
> 
> The average batch size approach is a quick and dirty approach for non-leaf 
> operators that can observe an incoming batch to estimate row width. Because 
> Drill batches are large, the law of large numbers means that the average of a 
> large input batch is likely to be a good estimator for the average size of a 
> large output batch.
> Note that this works only because non-leaf operators have an input batch to 
> sample. Leaf operators (readers) do not have this luxury. Hence the result 
> set loader uses the actual accumulated size for the current batch.
> Also note that the average row method, while handy, is not optimal. It will, 
> in general, result in greater internal fragmentation than the result set 
> loader. Why? The result set loader packs vectors right up to the point where 
> the largest would overflow. The average row method works at the aggregate 
> level and will likely result in wasted space (internal fragmentation) in the 
> largest vector. Said another way, with the average row size method, we can 
> usually pack in a few more rows before the batch actually fills, and so we 
> end up with batches with lower "density" than the optimal. This is important 
> when the consuming operator is a buffering one such as sort.
> The key reason Padma is using the quick & dirty average row size method is 
> not that it is ideal (it is not), but rather that it is, in fact, quick.
> We do want to move to the result set loader over time so we get improved 
> memory utilization. And, it is the only way to control row size in readers 
> such as CSV or JSON in which we have no size information until we read the 
> data.
> - Paul  
  

Re: Batch Sizing for Parquet Flat Reader

2018-02-11 Thread Padma Penumarthy
With average row size method, since I know number of rows and the average size 
for each column, 
I am planning to use that information to allocate required memory for each 
vector upfront. 
This should help avoid copying every time we double and also improve memory 
utilization.

Thanks
Padma


> On Feb 11, 2018, at 3:44 PM, Paul Rogers  wrote:
> 
> One more thought:
>>> 3) Assuming that you go with the average batch size calculation approach,
> 
> The average batch size approach is a quick and dirty approach for non-leaf 
> operators that can observe an incoming batch to estimate row width. Because 
> Drill batches are large, the law of large numbers means that the average of a 
> large input batch is likely to be a good estimator for the average size of a 
> large output batch.
> Note that this works only because non-leaf operators have an input batch to 
> sample. Leaf operators (readers) do not have this luxury. Hence the result 
> set loader uses the actual accumulated size for the current batch.
> Also note that the average row method, while handy, is not optimal. It will, 
> in general, result in greater internal fragmentation than the result set 
> loader. Why? The result set loader packs vectors right up to the point where 
> the largest would overflow. The average row method works at the aggregate 
> level and will likely result in wasted space (internal fragmentation) in the 
> largest vector. Said another way, with the average row size method, we can 
> usually pack in a few more rows before the batch actually fills, and so we 
> end up with batches with lower "density" than the optimal. This is important 
> when the consuming operator is a buffering one such as sort.
> The key reason Padma is using the quick & dirty average row size method is 
> not that it is ideal (it is not), but rather that it is, in fact, quick.
> We do want to move to the result set loader over time so we get improved 
> memory utilization. And, it is the only way to control row size in readers 
> such as CSV or JSON in which we have no size information until we read the 
> data.
> - Paul   



Re: Batch Sizing for Parquet Flat Reader

2018-02-11 Thread Paul Rogers
Hi Salim.
Thanks much for the detailed explanation! You clearly have developed a deep 
understanding of the Parquet code and its impact on CPU and I/O performance. My 
comments are more from the holistic perspective as Drill as a whole.
Far too much to discuss on the dev list. I've added your comments, and my 
response, to DRILL-6147.
The key question is: what is our end goal and what path gets us there with the 
least effort? Perhaps those design doc updates Parth requested could spell that 
out a bit more.
Thanks,
- Paul


On Sunday, February 11, 2018, 2:36:14 PM PST, salim achouche 
 wrote:  
 
 Thanks Paul for your feedback! let me try to answer some of your questions / 
comments:

Duplicate Implementation
- I am not contemplating two different implementations; one for Parquet and 
another for the rest of the code
- Instead, I am reacting to the fact that we have two different processing 
patterns Row Oriented and Columnar
- The goal is to offer both strategies depending on the operator

Complex Vs Flat Parquet Readers
- The Complex and Flat Parquet readers are quite different
- I presume, for the sake of performance, we can enhance our SQL capabilities 
so that the Flat Parquet reader can be invoked more frequently

Predicate Pushdown
- The reason I invoked Predicate Pushdown within the document is to help the 
analysis:
  o Notice how Record Batch materialization could involve many more pages
  o A solution that relies mainly on the current set of pages (one per column) 
might pay a heavy IO price without much to show for
      + By waiting for all columns to have at least one page loaded so that 
upfront stats are gathered 
      + Batch memory is then divided optimally across columns and the current 
batch size is computed
      + Unfortunately, such logic will fail if more pages are involved than the 
ones taken in consideration
  o Example -
      + Two variable length columns c1 and c2
      + Reader waits for two pages P1-1 and P2-1 so that we a) allocate memory 
optimally across c1 and c2 and b) compute a batch size that will minimize 
overflow logic
      + Assume, because of data length skew or predicate pushdown, that more 
pages are involved in loading the batch
      + for c1: {P1-1, P1-2, P1-3, P1-4}, c2: {P2-1, P2-2} 
      + It is now highly possible that overflow logic might not be optimal 
since only  two pages statistics were considered instead of six

 - I have added new logic to the ScanBatch so to log (on-demand) extra batch 
statistics which will help us assess the efficiency of the batch sizing 
strategy; will add this information to the document when this sub-task is done


Implementation Strategy
- DRILL-6147 mission is to implement batch sizing for Flat Parquet with minimal 
overhead
- This will also help us test this functionality for end-to-end cases (whole 
query)
- My next task (after DRILL-6147) is to incorporate your framework with Parquet 
- I’ll will a) enhance the framework to support columnar processing and b) 
refactor the Parquet code to user the framework
- I agree there might be some duplicate effort but I really believe this will 
be minimal
- DRILL-6147 is not more than one week of research & analysis and one week of 
implementation

Regards,
Salim



> On Feb 11, 2018, at 1:35 PM, Paul Rogers  wrote:
> 
> Hi All,
> Perhaps this topic needs just a bit more thought and discussion to avoid 
> working at cross purposes. I've outlined the issues, and a possible path 
> forward, in a comment to DRILL-6147.
> Quick summary: creating a second batch size implementation just for Parquet 
> will be very difficult once we handle all the required use cases as spelled 
> out in the comment. We'd want to be very sure that we do, indeed, want to 
> duplicate this effort before we head down that route. Duplicating the effort 
> means repeating all the work done over the last six months to make the 
> original result set loader work, and the future work needed to maintain two 
> parallel systems. This is not a decision to make by default.
> Thanks,
> - Paul
> 
>    On Sunday, February 11, 2018, 12:10:58 AM PST, Parth Chandra 
> wrote:  
> 
> Thanks Salim.
> Can you add this to the JIRA/design doc. Also, I would venture to suggest
> that the section on predicate pushdown can be made clearer.
> Also, Since you're proposing the average batch size approach with overflow
> handling, some detail on the proposed changes to the framework would be
> useful in the design doc. (Perhaps pseudo code and affected classes.)
> Essentially some guarantees provided by the framework will change and this
> may affect (or not) the existing usage. These should be enumerated in the
> design doc.
> 
> 
  

Re: Batch Sizing for Parquet Flat Reader

2018-02-11 Thread Paul Rogers
Parth notes:

Also note that memory allocations by Netty greater than the 16MB chunk sizeare 
returned to the OS when the memory is free'd. Both this document andthe 
original document on memory fragmentation state incorrectly that suchmemory is 
not released back to the OS. A quick thought experiment - wheredoes this memory 
go if it is not released back to the OS?

This is true. If the original docs said otherwise, then it is an error for 
which I apologize. If this were not true, we'd have lots of memory leaks, which 
we'd have found and fixed. So, clearly memory is returned.
It is not returning memory to the OS that is the issue. Rather, it is the 
fragmentation that occurs when most memory is on the Netty free list and we 
want to get a large chunk from the OS. We can run out of memory even when lots 
is free (in Netty).
The original jemalloc paper talks about an algorithm to return unused memory to 
the OS, perhaps we can add that to our own Netty-based allocator.
We'd want to be clever, however, because allocations from the OS are 1000 times 
slower than allocations from the Netty free list, or at least that was try in a 
prototype I did a year ago on the Mac.
Further, in the general case, even Netty is not a panacea. Even if we keep 
blocks to 16 MB and smaller, doing random sized allocations in random order 
will cause Netty fragmentation: we might want a 16 MB block, half of memory 
might be free, but due to historical alloc/free patterns, all memory is free as 
8 GB blocks and so allocation fails.
Java avoids this issue because it does compaction of free heap space. I'd guess 
we don't really want to try to implement that for direct memory.
This is why DBs generally use fixed-size allocations: it completely avoids 
memory fragmentation issues. One of the goals of the recent "result set loader" 
work is to encapsulate all vector accesses in a higher-level abstraction so 
that, eventually, we can try alternative memory layouts with minimal impact on 
the rest of Drill code. (The column reader and writer layer isolates code from 
actual vector APIs and memory layout.)
Thanks,
- Paul  

Re: Batch Sizing for Parquet Flat Reader

2018-02-11 Thread salim achouche
Paul,

I cannot thank you enough for your help and guidance! You are right that 
columnar readers will have a harder time balancing resource requirements and 
performance. Nevertheless, DRILL-6147 is a starting point; it should allow us 
to gain knowledge and accordingly refine our strategy as we go.

FYI - On a completely different topic; I was working on an EBF regarding the 
Parquet complex reader (though the bug was midstream). I was surprised by the 
level of overhead associated with nested data processing; literarily, the code 
was jumping from one column/level to another just to process a single value. 
There was a comment to perform such processing in a bulk manner (which I agree 
with). The moral of the story is that Drill is dealing with complex use-cases 
that haven’t been dealt with before (at least not with great success); as can 
be seen, we started with simpler solutions only to realize they are 
inefficient. What is needed, is to spend time understanding such use-cases and 
incrementally attempt perfecting those shortcomings.

Regards,
Salim


> On Feb 11, 2018, at 3:44 PM, Paul Rogers  
> om.INVALID> wrote:
> 
> One more thought:
>>> 3) Assuming that you go with the average batch size calculation approach,
> 
> The average batch size approach is a quick and dirty approach for non-leaf 
> operators that can observe an incoming batch to estimate row width. Because 
> Drill batches are large, the law of large numbers means that the average of a 
> large input batch is likely to be a good estimator for the average size of a 
> large output batch.
> Note that this works only because non-leaf operators have an input batch to 
> sample. Leaf operators (readers) do not have this luxury. Hence the result 
> set loader uses the actual accumulated size for the current batch.
> Also note that the average row method, while handy, is not optimal. It will, 
> in general, result in greater internal fragmentation than the result set 
> loader. Why? The result set loader packs vectors right up to the point where 
> the largest would overflow. The average row method works at the aggregate 
> level and will likely result in wasted space (internal fragmentation) in the 
> largest vector. Said another way, with the average row size method, we can 
> usually pack in a few more rows before the batch actually fills, and so we 
> end up with batches with lower "density" than the optimal. This is important 
> when the consuming operator is a buffering one such as sort.
> The key reason Padma is using the quick & dirty average row size method is 
> not that it is ideal (it is not), but rather that it is, in fact, quick.
> We do want to move to the result set loader over time so we get improved 
> memory utilization. And, it is the only way to control row size in readers 
> such as CSV or JSON in which we have no size information until we read the 
> data.
> - Paul   



Re: Batch Sizing for Parquet Flat Reader

2018-02-11 Thread Paul Rogers
One more thought:
> > 3) Assuming that you go with the average batch size calculation approach,

The average batch size approach is a quick and dirty approach for non-leaf 
operators that can observe an incoming batch to estimate row width. Because 
Drill batches are large, the law of large numbers means that the average of a 
large input batch is likely to be a good estimator for the average size of a 
large output batch.
Note that this works only because non-leaf operators have an input batch to 
sample. Leaf operators (readers) do not have this luxury. Hence the result set 
loader uses the actual accumulated size for the current batch.
Also note that the average row method, while handy, is not optimal. It will, in 
general, result in greater internal fragmentation than the result set loader. 
Why? The result set loader packs vectors right up to the point where the 
largest would overflow. The average row method works at the aggregate level and 
will likely result in wasted space (internal fragmentation) in the largest 
vector. Said another way, with the average row size method, we can usually pack 
in a few more rows before the batch actually fills, and so we end up with 
batches with lower "density" than the optimal. This is important when the 
consuming operator is a buffering one such as sort.
The key reason Padma is using the quick & dirty average row size method is not 
that it is ideal (it is not), but rather that it is, in fact, quick.
We do want to move to the result set loader over time so we get improved memory 
utilization. And, it is the only way to control row size in readers such as CSV 
or JSON in which we have no size information until we read the data.
- Paul   

Re: Batch Sizing for Parquet Flat Reader

2018-02-11 Thread salim achouche
Thanks Parth for your feedback! I am planning to enhance the document based
on the received feedback and the prototype I am currently working on!

Regards,
Salim

On Sun, Feb 11, 2018 at 2:36 PM, salim achouche 
wrote:

> Thanks Paul for your feedback! let me try to answer some of your questions
> / comments:
>
> *Duplicate Implementation*
> - I am *not* contemplating two different implementations; one for Parquet
> and another for the rest of the code
> - Instead, I am reacting to the fact that we have two different processing
> patterns Row Oriented and Columnar
> - The goal is to offer both strategies depending on the operator
>
> Complex Vs Flat Parquet Readers
> - The Complex and Flat Parquet readers are quite different
> - I presume, for the sake of performance, we can enhance our SQL
> capabilities so that the Flat Parquet reader can be invoked more frequently
>
> *Predicate Pushdown*
> - The reason I invoked Predicate Pushdown within the document is to help
> the analysis:
>o Notice how Record Batch materialization could involve many more pages
>o A solution that relies mainly on the current set of pages (one per
> column) might pay a heavy IO price without much to show for
>   + By waiting for all columns to have at least one page loaded so
> that upfront stats are gathered
>   + Batch memory is then divided optimally across columns and the
> current batch size is computed
>   + Unfortunately, such logic will fail if more pages are involved
> than the ones taken in consideration
>o Example -
>   + Two variable length columns c1 and c2
>   + Reader waits for two pages P1-1 and P2-1 so that we a) allocate
> memory optimally across c1 and c2 and b) compute a batch size that will
> minimize overflow logic
>   + Assume, because of data length skew or predicate pushdown, that
> more pages are involved in loading the batch
>   + for c1: {P1-1, P1-2, P1-3, P1-4}, c2: {P2-1, P2-2}
>   + It is now highly possible that overflow logic might not be optimal
> since only  two pages statistics were considered instead of six
>
>  - I have added new logic to the ScanBatch so to log (on-demand) extra
> batch statistics which will help us assess the efficiency of the batch
> sizing strategy; will add this information to the document when this
> sub-task is done
>
>
> *Implementation Strategy*
> - DRILL-6147 mission is to implement batch sizing for Flat Parquet with 
> *minimal
> overhead*
> - This will also help us test this functionality for end-to-end cases
> (whole query)
> - My next task (after DRILL-6147) is to incorporate your framework with
> Parquet
> - I’ll will a) enhance the framework to support columnar processing and b)
> refactor the Parquet code to user the framework
> *- *I agree there might be some duplicate effort but I really believe
> this will be minimal
> - DRILL-6147 is not more than one week of research & analysis and one week
> of implementation
>
> Regards,
> Salim
>
>
>
> On Feb 11, 2018, at 1:35 PM, Paul Rogers 
> wrote:
>
> Hi All,
> Perhaps this topic needs just a bit more thought and discussion to avoid
> working at cross purposes. I've outlined the issues, and a possible path
> forward, in a comment to DRILL-6147.
> Quick summary: creating a second batch size implementation just for
> Parquet will be very difficult once we handle all the required use cases as
> spelled out in the comment. We'd want to be very sure that we do, indeed,
> want to duplicate this effort before we head down that route. Duplicating
> the effort means repeating all the work done over the last six months to
> make the original result set loader work, and the future work needed to
> maintain two parallel systems. This is not a decision to make by default.
> Thanks,
> - Paul
>
>On Sunday, February 11, 2018, 12:10:58 AM PST, Parth Chandra <
> par...@apache.org> wrote:
>
> Thanks Salim.
> Can you add this to the JIRA/design doc. Also, I would venture to suggest
> that the section on predicate pushdown can be made clearer.
> Also, Since you're proposing the average batch size approach with overflow
> handling, some detail on the proposed changes to the framework would be
> useful in the design doc. (Perhaps pseudo code and affected classes.)
> Essentially some guarantees provided by the framework will change and this
> may affect (or not) the existing usage. These should be enumerated in the
> design doc.
>
>
>
>


Re: Batch Sizing for Parquet Flat Reader

2018-02-11 Thread salim achouche
Thanks Paul for your feedback! let me try to answer some of your questions / 
comments:

Duplicate Implementation
- I am not contemplating two different implementations; one for Parquet and 
another for the rest of the code
- Instead, I am reacting to the fact that we have two different processing 
patterns Row Oriented and Columnar
- The goal is to offer both strategies depending on the operator

Complex Vs Flat Parquet Readers
- The Complex and Flat Parquet readers are quite different
- I presume, for the sake of performance, we can enhance our SQL capabilities 
so that the Flat Parquet reader can be invoked more frequently

Predicate Pushdown
- The reason I invoked Predicate Pushdown within the document is to help the 
analysis:
   o Notice how Record Batch materialization could involve many more pages
   o A solution that relies mainly on the current set of pages (one per column) 
might pay a heavy IO price without much to show for
  + By waiting for all columns to have at least one page loaded so that 
upfront stats are gathered 
  + Batch memory is then divided optimally across columns and the current 
batch size is computed
  + Unfortunately, such logic will fail if more pages are involved than the 
ones taken in consideration
   o Example -
  + Two variable length columns c1 and c2
  + Reader waits for two pages P1-1 and P2-1 so that we a) allocate memory 
optimally across c1 and c2 and b) compute a batch size that will minimize 
overflow logic
  + Assume, because of data length skew or predicate pushdown, that more 
pages are involved in loading the batch
  + for c1: {P1-1, P1-2, P1-3, P1-4}, c2: {P2-1, P2-2} 
  + It is now highly possible that overflow logic might not be optimal 
since only  two pages statistics were considered instead of six

 - I have added new logic to the ScanBatch so to log (on-demand) extra batch 
statistics which will help us assess the efficiency of the batch sizing 
strategy; will add this information to the document when this sub-task is done


Implementation Strategy
- DRILL-6147 mission is to implement batch sizing for Flat Parquet with minimal 
overhead
- This will also help us test this functionality for end-to-end cases (whole 
query)
- My next task (after DRILL-6147) is to incorporate your framework with Parquet 
- I’ll will a) enhance the framework to support columnar processing and b) 
refactor the Parquet code to user the framework
- I agree there might be some duplicate effort but I really believe this will 
be minimal
- DRILL-6147 is not more than one week of research & analysis and one week of 
implementation

Regards,
Salim



> On Feb 11, 2018, at 1:35 PM, Paul Rogers  wrote:
> 
> Hi All,
> Perhaps this topic needs just a bit more thought and discussion to avoid 
> working at cross purposes. I've outlined the issues, and a possible path 
> forward, in a comment to DRILL-6147.
> Quick summary: creating a second batch size implementation just for Parquet 
> will be very difficult once we handle all the required use cases as spelled 
> out in the comment. We'd want to be very sure that we do, indeed, want to 
> duplicate this effort before we head down that route. Duplicating the effort 
> means repeating all the work done over the last six months to make the 
> original result set loader work, and the future work needed to maintain two 
> parallel systems. This is not a decision to make by default.
> Thanks,
> - Paul
> 
>On Sunday, February 11, 2018, 12:10:58 AM PST, Parth Chandra 
>  wrote:  
> 
> Thanks Salim.
> Can you add this to the JIRA/design doc. Also, I would venture to suggest
> that the section on predicate pushdown can be made clearer.
> Also, Since you're proposing the average batch size approach with overflow
> handling, some detail on the proposed changes to the framework would be
> useful in the design doc. (Perhaps pseudo code and affected classes.)
> Essentially some guarantees provided by the framework will change and this
> may affect (or not) the existing usage. These should be enumerated in the
> design doc.
> 
> 



Re: Batch Sizing for Parquet Flat Reader

2018-02-11 Thread Paul Rogers
Hi All,
Perhaps this topic needs just a bit more thought and discussion to avoid 
working at cross purposes. I've outlined the issues, and a possible path 
forward, in a comment to DRILL-6147.
Quick summary: creating a second batch size implementation just for Parquet 
will be very difficult once we handle all the required use cases as spelled out 
in the comment. We'd want to be very sure that we do, indeed, want to duplicate 
this effort before we head down that route. Duplicating the effort means 
repeating all the work done over the last six months to make the original 
result set loader work, and the future work needed to maintain two parallel 
systems. This is not a decision to make by default.
Thanks,
- Paul

On Sunday, February 11, 2018, 12:10:58 AM PST, Parth Chandra 
 wrote:  
 
 Thanks Salim.
Can you add this to the JIRA/design doc. Also, I would venture to suggest
that the section on predicate pushdown can be made clearer.
Also, Since you're proposing the average batch size approach with overflow
handling, some detail on the proposed changes to the framework would be
useful in the design doc. (Perhaps pseudo code and affected classes.)
 Essentially some guarantees provided by the framework will change and this
may affect (or not) the existing usage. These should be enumerated in the
design doc.


  

Re: Batch Sizing for Parquet Flat Reader

2018-02-11 Thread Parth Chandra
Thanks Salim.
Can you add this to the JIRA/design doc. Also, I would venture to suggest
that the section on predicate pushdown can be made clearer.
Also, Since you're proposing the average batch size approach with overflow
handling, some detail on the proposed changes to the framework would be
useful in the design doc. (Perhaps pseudo code and affected classes.)
 Essentially some guarantees provided by the framework will change and this
may affect (or not) the existing usage. These should be enumerated in the
design doc.





On Fri, Feb 9, 2018 at 11:52 PM, salim achouche 
wrote:

> Thank you Parth for providing feedback; please find my answers below:
>
> I have created Apache JIRA DRILL-6147
>  for tracking
> this improvement.
>
> >  2) Not sure where you were going with the predicate pushdown section and
> how it pertains to your proposed batch sizing.
>
> Predicate push down was part of the Design Considerations section; the
> intent is that the design should be able to handle future use-cases such as
> push down. Notice how the Page based Statistical Approach will not work
> well with predicate push down as one single batch can span many pages per
> column.
>
> > 3) Assuming that you go with the average batch size calculation approach,
> are you proposing to have a Parquet scan specific overflow implementation?
> Or are you planning to leverage the ResultSet loader mechanism? If you plan
> to use the latter, it will need to be enhanced to handle a bulk chunk as
> opposed to a single value at a time. If not using the ResultSet loader
> mechanism, why not (you would be reinventing the wheel) ?
>
> Padma Penumarthy and I are currently working on the batch sizing
> functionality and selected few TPCH queries to show case end-to-end use
> cases. Immediately after this task, I'll be working on enhancing the new
> framework to support columnar processing and as such retrofit DRILL-6147
> implementation as part of the new framework. So essentially we want to make
> progress in both fronts so that a) OOM conditions are minimized as soon as
> possible and b) the new Reader framework is applied to all readers and
> operators is rolled out in the next few releases.
>
> > Also note that memory allocations by Netty greater than the 16MB chunk
> size
> are returned to the OS when the memory is free'd. Both this document and
> the original document on memory fragmentation state incorrectly that such
> memory is not released back to the OS. A quick thought experiment - where
> does this memory go if it is not released back to the OS?
>
> I have the same understanding as you:
> - I think Paul meant that 16 MB blocks are not released to the OS (cached
> within Netty)
> - Many memory allocators exhibit the same behavior as the release mechanism
> is slow (heuristics used to decide when to release so to balance between
> performance and resource usage)
> - Basically, if Drill holds a large count of 16 MB blocks, than a 32 MB, 64
> MB , etc memory allocation might fail since
>   *  none of the Netty allocated blocks can satisfy the new request
>   *  a new OS allocation will take Drill beyond the maximum direct memory
>
>
> On Fri, Feb 9, 2018 at 4:08 AM, Parth Chandra  wrote:
>
> > Is there a JIRA for this? Would be useful to capture the comments in the
> > JIRA. Note that the document itself is not comment-able as it is shared
> > with view-only permissions.
> >
> > Some thoughts in no particular order-
> > 1) The Page based statistical approach is likely to run into trouble with
> > the encoding used for Parquet fields especially RLE which drastically
> > changes the size of the field. So pageSize/numValues is going to be
> wildly
> > inaccurate with RLE.
> > 2) Not sure where you were going with the predicate pushdown section and
> > how it pertains to your proposed batch sizing.
> > 3) Assuming that you go with the average batch size calculation approach,
> > are you proposing to have a Parquet scan specific overflow
> implementation?
> > Or are you planning to leverage the ResultSet loader mechanism? If you
> plan
> > to use the latter, it will need to be enhanced to handle a bulk chunk as
> > opposed to a single value at a time. If not using the ResultSet loader
> > mechanism, why not (you would be reinventing the wheel) ?
> > 4) Parquet page level stats are probably not reliable. You can assume
> page
> > size (compressed/uncompressed) and value count are accurate, but nothing
> > else.
> >
> > Also note that memory allocations by Netty greater than the 16MB chunk
> size
> > are returned to the OS when the memory is free'd. Both this document and
> > the original document on memory fragmentation state incorrectly that such
> > memory is not released back to the OS. A quick thought experiment - where
> > does this memory go if it is not released back to the OS?
> >
> >
> >
> > On Fri, Feb 9, 2018 at 7:12 AM, salim 

Re: Batch Sizing for Parquet Flat Reader

2018-02-09 Thread salim achouche
Thank you Parth for providing feedback; please find my answers below:

I have created Apache JIRA DRILL-6147
 for tracking
this improvement.

>  2) Not sure where you were going with the predicate pushdown section and
how it pertains to your proposed batch sizing.

Predicate push down was part of the Design Considerations section; the
intent is that the design should be able to handle future use-cases such as
push down. Notice how the Page based Statistical Approach will not work
well with predicate push down as one single batch can span many pages per
column.

> 3) Assuming that you go with the average batch size calculation approach,
are you proposing to have a Parquet scan specific overflow implementation?
Or are you planning to leverage the ResultSet loader mechanism? If you plan
to use the latter, it will need to be enhanced to handle a bulk chunk as
opposed to a single value at a time. If not using the ResultSet loader
mechanism, why not (you would be reinventing the wheel) ?

Padma Penumarthy and I are currently working on the batch sizing
functionality and selected few TPCH queries to show case end-to-end use
cases. Immediately after this task, I'll be working on enhancing the new
framework to support columnar processing and as such retrofit DRILL-6147
implementation as part of the new framework. So essentially we want to make
progress in both fronts so that a) OOM conditions are minimized as soon as
possible and b) the new Reader framework is applied to all readers and
operators is rolled out in the next few releases.

> Also note that memory allocations by Netty greater than the 16MB chunk
size
are returned to the OS when the memory is free'd. Both this document and
the original document on memory fragmentation state incorrectly that such
memory is not released back to the OS. A quick thought experiment - where
does this memory go if it is not released back to the OS?

I have the same understanding as you:
- I think Paul meant that 16 MB blocks are not released to the OS (cached
within Netty)
- Many memory allocators exhibit the same behavior as the release mechanism
is slow (heuristics used to decide when to release so to balance between
performance and resource usage)
- Basically, if Drill holds a large count of 16 MB blocks, than a 32 MB, 64
MB , etc memory allocation might fail since
  *  none of the Netty allocated blocks can satisfy the new request
  *  a new OS allocation will take Drill beyond the maximum direct memory


On Fri, Feb 9, 2018 at 4:08 AM, Parth Chandra  wrote:

> Is there a JIRA for this? Would be useful to capture the comments in the
> JIRA. Note that the document itself is not comment-able as it is shared
> with view-only permissions.
>
> Some thoughts in no particular order-
> 1) The Page based statistical approach is likely to run into trouble with
> the encoding used for Parquet fields especially RLE which drastically
> changes the size of the field. So pageSize/numValues is going to be wildly
> inaccurate with RLE.
> 2) Not sure where you were going with the predicate pushdown section and
> how it pertains to your proposed batch sizing.
> 3) Assuming that you go with the average batch size calculation approach,
> are you proposing to have a Parquet scan specific overflow implementation?
> Or are you planning to leverage the ResultSet loader mechanism? If you plan
> to use the latter, it will need to be enhanced to handle a bulk chunk as
> opposed to a single value at a time. If not using the ResultSet loader
> mechanism, why not (you would be reinventing the wheel) ?
> 4) Parquet page level stats are probably not reliable. You can assume page
> size (compressed/uncompressed) and value count are accurate, but nothing
> else.
>
> Also note that memory allocations by Netty greater than the 16MB chunk size
> are returned to the OS when the memory is free'd. Both this document and
> the original document on memory fragmentation state incorrectly that such
> memory is not released back to the OS. A quick thought experiment - where
> does this memory go if it is not released back to the OS?
>
>
>
> On Fri, Feb 9, 2018 at 7:12 AM, salim achouche 
> wrote:
>
> > The following document
> >  > 9RwG4h0sI81KI5ZEvJ7HzgClCUFpB5WE/edit?ts=5a793606#>
> > describes
> > a proposal for enforcing batch sizing constraints (count and memory)
> within
> > the Parquet Reader (Flat Schema). Please feel free to take a look and
> > provide feedback.
> >
> > Thanks!
> >
> > Regards,
> > Salim
> >
>


Re: Batch Sizing for Parquet Flat Reader

2018-02-09 Thread Parth Chandra
Is there a JIRA for this? Would be useful to capture the comments in the
JIRA. Note that the document itself is not comment-able as it is shared
with view-only permissions.

Some thoughts in no particular order-
1) The Page based statistical approach is likely to run into trouble with
the encoding used for Parquet fields especially RLE which drastically
changes the size of the field. So pageSize/numValues is going to be wildly
inaccurate with RLE.
2) Not sure where you were going with the predicate pushdown section and
how it pertains to your proposed batch sizing.
3) Assuming that you go with the average batch size calculation approach,
are you proposing to have a Parquet scan specific overflow implementation?
Or are you planning to leverage the ResultSet loader mechanism? If you plan
to use the latter, it will need to be enhanced to handle a bulk chunk as
opposed to a single value at a time. If not using the ResultSet loader
mechanism, why not (you would be reinventing the wheel) ?
4) Parquet page level stats are probably not reliable. You can assume page
size (compressed/uncompressed) and value count are accurate, but nothing
else.

Also note that memory allocations by Netty greater than the 16MB chunk size
are returned to the OS when the memory is free'd. Both this document and
the original document on memory fragmentation state incorrectly that such
memory is not released back to the OS. A quick thought experiment - where
does this memory go if it is not released back to the OS?



On Fri, Feb 9, 2018 at 7:12 AM, salim achouche  wrote:

> The following document
>  9RwG4h0sI81KI5ZEvJ7HzgClCUFpB5WE/edit?ts=5a793606#>
> describes
> a proposal for enforcing batch sizing constraints (count and memory) within
> the Parquet Reader (Flat Schema). Please feel free to take a look and
> provide feedback.
>
> Thanks!
>
> Regards,
> Salim
>


Batch Sizing for Parquet Flat Reader

2018-02-08 Thread salim achouche
The following document

describes
a proposal for enforcing batch sizing constraints (count and memory) within
the Parquet Reader (Flat Schema). Please feel free to take a look and
provide feedback.

Thanks!

Regards,
Salim