[ https://issues.apache.org/jira/browse/MADLIB-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364602#comment-16364602 ]
Nandish Jayaram commented on MADLIB-1200: ----------------------------------------- A suggestion regarding the buffer_size: We could consider making it a factor of the page size rather than the 1GB limit. By page size I mean the default page size that the underlying database uses while fetching data from the disk (4 MB?). This might result in packing fewer rows together, but that may be a good thing from user's perspective. If we pack around 10000 rows into one, running a select query on that table, even with limit 1 takes a long time to load. With 1GB based factoring, we might be able to pack hundreds of thousands of rows, which makes it harder for the user to view it I guess. > Pre-processing helper function for mini-batching > ------------------------------------------------- > > Key: MADLIB-1200 > URL: https://issues.apache.org/jira/browse/MADLIB-1200 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Utilities > Reporter: Frank McQuillan > Priority: Major > Fix For: v1.14 > > > Related to > https://issues.apache.org/jira/browse/MADLIB-1037 > https://issues.apache.org/jira/browse/MADLIB-1048 > Story > {{As a}} > data scientist > {{I want to}} > pre-process input files for use with mini-batching > {{so that}} > the optimization part of MLP, SVM, etc. runs faster when I do multiple runs, > perhaps because I am tuning parameters (i.e., pre-processing is an occasional > operation that I don't want to re-do every time that I train a model) > Interface > {code:java} > minibatch_preprocessor ( > source_table, -- Name of the table containing the input > data. > output_table, -- Name of the table suitable for > mini-batching. > dependent_varname, -- Name of the dependent variable column. > independent_varname, -- Expression list to evaluate for the independent > variables. > buffer_size, -- buffer_size? Default should be to > pack as much as possible in the 1GB limit imposed by postgres/gpdb. > ) > {code} > > The main purpose of the function is to prepare the training data for > minibatching algorithms. This will be achieved in 2 stages > # Based on the batch size, group all the dependent and independent variables > in a single tuple representative of the batch. > # If the independent variables are boolean or text, perform one hot > encoding. N/A for integer and floats. Note that if the integer vars are > actually categorical, they must be case to ::TEXT so that they get encoded. > Notes > 1) Random shuffle needed for mini-batch. > 2) Naive approach may be OK to start, not worth big investment to make run > 10% or 20% faster. > Acceptance > Summary > 1) Convert from standard to special format for mini-batching > 2) Standardize by default for now but the user cannot opt out of it. We may > decide to add a flag later. > 3) Some scale testing OK (does not need to be comprehensive) > 4) Document as a helper function user docs > 5) Always ignore nulls in dependent variable > 6) IC -- This message was sent by Atlassian JIRA (v7.6.3#76005)