[ https://issues.apache.org/jira/browse/MADLIB-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16426356#comment-16426356 ]
Jingyi Mei edited comment on MADLIB-1224 at 4/5/18 12:12 AM: ------------------------------------------------------------- Some math: 1. For row size, since each row from source table will be put into an array first, and then arrays will be aggregated using madlib.matrix_agg, ([ ] double precision, with 8 byte for one element in an array), the estimated size s for one super-row is: _s = 8byte* num_of_element_in_an_array* buffer_size_ And s <= 1GB 2. For data distribution, each segment will get k super-rows: _k = total_num_of_rows_in_source_table/(buffer_size * num_of_segment)_ And k >= p, where p is # of rows stored in each segment (threshold) We make a more conservative constraint in calculation 1, i.e., s<=600MB, and after simplifying 1 and 2, we got _Buffer_size <= min(75 million/num_of_element_in_an_array, total_num_of_rows_in_source_table/(num_of_segment*p))_ _the current implementation take p=1, i.e each segment will get at least 1 row._ was (Author: jingyimei): Some math: # For row size, since each row from source table will be put into an array first, and then arrays will be aggregated using madlib.matrix_agg, ([ ] double precision, with 8 byte for one element in an array), the estimated size s for one super-row is: _s = 8byte* num_of_element_in_an_array* buffer_size_ And s <= 1GB # For data distribution, each segment will get k super-rows: _k = total_num_of_rows_in_source_table/(buffer_size * num_of_segment)_ And k >= p, where p is # of rows stored in each segment (threshold) We make a more conservative constraint in calculation 1, i.e., s<=600MB, and after simplifying 1 and 2, we got _Buffer_size <= min(75 million/num_of_element_in_an_array, total_num_of_rows_in_source_table/(num_of_segment*p))_ > Select default buffer size for mini-batch preprocessor > ------------------------------------------------------ > > Key: MADLIB-1224 > URL: https://issues.apache.org/jira/browse/MADLIB-1224 > Project: Apache MADlib > Issue Type: Improvement > Components: Module: Utilities > Reporter: Jingyi Mei > Priority: Major > Fix For: v1.14 > > > As a follow up of https://issues.apache.org/jira/browse/MADLIB-1200 > > In minibatch_preprocessor, we made buffer_size as an optional parameter. If > it is not set, some default value will be assigned. Current considerations > are: > # Within segment, each cell has 1GB limit so that we can't put too many rows > into one super row to exceed the limit > # Among segments, data should be distributed as equally as possible to avoid > data skew so that GPDB can work more efficiently. -- This message was sent by Atlassian JIRA (v7.6.3#76005)