[ 
https://issues.apache.org/jira/browse/MADLIB-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16037855#comment-16037855
 ] 

ASF GitHub Bot commented on MADLIB-1117:
----------------------------------------

GitHub user iyerr3 opened a pull request:

    https://github.com/apache/incubator-madlib/pull/138

    Summary: Add param to determine num of cols per run

    JIRA: MADLIB-1117
    
    Summary used a hard-coded parameter of a maximum of 15 columns per run.
    This was put in place to avoid out-of-memory errors in most cases.
    This, however, limits the run time since higher number of columns can be
    summarized in a single run for a simpler data set (one which leads to
    smaller sketch data structures).
    
    This commit adds a new parameter allowing users to set this limit,
    while retaining the old default of 15 columns.
    
    Closes #138

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/iyerr3/incubator-madlib 
feature/summary_add_parameter

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-madlib/pull/138.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #138
    
----
commit 1cca783b63111d004662f314cef67e9be8bb9a92
Author: Rahul Iyer <[email protected]>
Date:   2017-06-05T23:36:50Z

    Summary: Add param to determine num of cols per run
    
    JIRA: MADLIB-1117
    
    Summary used a hard-coded parameter of a maximum of 15 columns per run.
    This was put in place to avoid out-of-memory errors in most cases.
    This, however, limits the run time since higher number of columns can be
    summarized in a single run for a simpler data set (one which leads to
    smaller sketch data structures).
    
    This commit adds a new parameter allowing users to set this limit,
    while retaining the old default of 15 columns.
    
    Closes #138

----


> Add "columns to process per pass" as an optional param for summary()
> --------------------------------------------------------------------
>
>                 Key: MADLIB-1117
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1117
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Sketch-based Estimators
>            Reporter: Frank McQuillan
>            Assignee: Rahul Iyer
>            Priority: Minor
>             Fix For: v1.12
>
>
> Context
> The summary() function
> http://madlib.incubator.apache.org/docs/latest/group__grp__summary.html
> currently processes 15 columns per pass to keep memory usage below 1 GB 
> limit.  This is a somewhat arbitrary limit since memory usage depends on many 
> things including data set, and which params in summary() are set.  If more 
> columns per pass could be used, summary() would run faster.
> Story
> As a MADlib developer, I want to add "columns to process per pass" as an 
> optional param for summary() function.  Default: use 15 columns (which is the 
> current setting).  Suggested param name:  "columns_per_pass" though if you 
> have a better name, that's fine.
> Acceptance
> 1) Add new optional parameter and update docs.  Please add a note so it is 
> clear what this control does.
> 2) Write and pass tests.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to