Jingyi Mei commented on MADLIB-1224:

Some math:
 #  For row size, since each row from source table will be put into an array 
first, and then arrays will be aggregated using madlib.matrix_agg, ([ ] double 
precision, with 8 byte for one element in an array), the estimated size s for 
one super-row is:

_s = 8byte* num_of_element_in_an_array* buffer_size_

And s <= 1GB

 #  For data distribution, each segment will get k super-rows:

_k = total_num_of_rows_in_source_table/(buffer_size * num_of_segment)_

And k >= p, 

where p is # of rows stored in each segment (threshold)


We make a more conservative constraint in calculation 1, i.e., s<=600MB, and 
after simplifying 1 and 2, we got


_Buffer_size <= min(75 million/num_of_element_in_an_array, 

> Select default buffer size for mini-batch preprocessor
> ------------------------------------------------------
>                 Key: MADLIB-1224
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1224
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Utilities
>            Reporter: Jingyi Mei
>            Priority: Major
>             Fix For: v1.14
> As a follow up of https://issues.apache.org/jira/browse/MADLIB-1200
> In minibatch_preprocessor, we made buffer_size as an optional parameter. If 
> it is not set, some default value will be assigned. Current considerations 
> are:
>  # Within segment, each cell has 1GB limit so that we can't put too many rows 
> into one super row to exceed the limit
>  # Among segments, data should be distributed as equally as possible to avoid 
> data skew so that GPDB can work more efficiently. 

This message was sent by Atlassian JIRA

Reply via email to