[ 
https://issues.apache.org/jira/browse/MADLIB-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16426365#comment-16426365
 ] 

Jingyi Mei commented on MADLIB-1224:
------------------------------------

We did test on a gpdb43 cluster(with orca off), with a dataset with 80 million 
rows and dimension_of_x=11. Here is the test result:

Testing buffer size 0.25milion
Time: 1090375.287 ms

Testing buffer size 0.5 million
Time: 1061828.721 ms

Testing buffer size 1 million
Time: 1052107.938 ms

Testing buffer size ~2.5milion
Time: 985144.526 ms

Testing calculated default buffer size ~4.7milion
Time: 917485.995 ms

Testing buffer size ~7.5milion
Time: 887226.883 ms

 

when trying buffer size 10million, we got error msg {{Memory allocation failed. 
Typically, this indicates that Greenplum Database limits the available memory 
to less than what is needed for this input}}. This error looks like gpdb run 
out of per-query-memory, which is different from the 1GB-per-row limit.

 

Later, we increased statement_mem to 2GB(the maximum allowed by postgres 
default setting), and it failed with the same error {{Memory allocation failed}}
Increasing gp_vmem_protect_limit to 8GB, it failed with the same error too.

We never got a chance to catch the actual 1GB per row limit since it always 
meets the statement_mem( the amount of memory the statement can use) first.

 

Based on the observation, we propose the following solution:

1. Based on the formula in the current implementation, the default size's 
maximum value won't exceeds 7.5million, which means our formula should be safe 
for now.

2. If user doesn't specify buffer size, we calculate default buffer size based 
on current formula and use it. If the user specifies a value, we don't 
calculate default buffer size anymore and just let it run. If it blows up, then 
user is responsible for tune it, e.g. buffer_size, gpdb settings, etc to find a 
fit.

> Select default buffer size for mini-batch preprocessor
> ------------------------------------------------------
>
>                 Key: MADLIB-1224
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1224
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Utilities
>            Reporter: Jingyi Mei
>            Priority: Major
>             Fix For: v1.14
>
>
> As a follow up of https://issues.apache.org/jira/browse/MADLIB-1200
>  
> In minibatch_preprocessor, we made buffer_size as an optional parameter. If 
> it is not set, some default value will be assigned. Current considerations 
> are:
>  # Within segment, each cell has 1GB limit so that we can't put too many rows 
> into one super row to exceed the limit
>  # Among segments, data should be distributed as equally as possible to avoid 
> data skew so that GPDB can work more efficiently. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to