[ 
https://issues.apache.org/jira/browse/MADLIB-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16426367#comment-16426367
 ] 

Jingyi Mei commented on MADLIB-1224:
------------------------------------

For grouping, we propose to modify the current formula from 
{code:java}
default_buffer_size = min(75000000.0/independent_var_dimension,
float(num_rows_processed)/num_of_segments){code}

To
{code:java}
default_buffer_size = min(75000000.0/independent_var_dimension,
float(avg_num_rows_processed_by_group)/num_of_segments){code}
Considerations are:
1. for 1GB per row limit, constraint `75000000.0/independent_var_dimension` 
still works because `75000000.0/independent_var_dimension` won't change with 
grouping

2. for data distribution, we want to data in each group gets distributed to all 
segment, which means we have to use `min(num_rows_processed_per_group)` to do 
the math, however, considering there might be extreme case that data is heavily 
skewed to one group and there is other group which only get few records(like 
1), the default_buffer_size will be 1 which basically does not pack data at 
all, we propose to use the average number of rows processed by group to make a 
balance.

> Select default buffer size for mini-batch preprocessor
> ------------------------------------------------------
>
>                 Key: MADLIB-1224
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1224
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Utilities
>            Reporter: Jingyi Mei
>            Priority: Major
>             Fix For: v1.14
>
>
> As a follow up of https://issues.apache.org/jira/browse/MADLIB-1200
>  
> In minibatch_preprocessor, we made buffer_size as an optional parameter. If 
> it is not set, some default value will be assigned. Current considerations 
> are:
>  # Within segment, each cell has 1GB limit so that we can't put too many rows 
> into one super row to exceed the limit
>  # Among segments, data should be distributed as equally as possible to avoid 
> data skew so that GPDB can work more efficiently. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to