[GitHub] madlib pull request #225: Added option for weighted average for both classif...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/225#discussion_r161918108 --- Diff: src/ports/postgres/modules/knn/knn.sql_in --- @@ -326,6 +331,39 @@ Result, with neighbors sorted from closest to furthest: (6 rows) + +-# Run KNN for classification using the +weighted average: + +DROP TABLE IF EXISTS knn_result_classification; +SELECT * FROM madlib.knn( +'knn_train_data', -- Table of training data +'data',-- Col name of training data +'id', -- Col name of id in train data +'label', -- Training labels +'knn_test_data', -- Table of test data +'data',-- Col name of test data +'id', -- Col name of id in test data +'knn_result_classification', -- Output table + 3,-- Number of nearest neighbors + True, -- True to list nearest-neighbors by id + 'madlib.squared_dist_norm2', -- Distance function + True -- For weighted average +); +SELECT * FROM knn_result_classification ORDER BY id; + + + id | data | prediction | k_nearest_neighbours ++-+-+-- + 1 | {2,1} | 2.2 | {1,2,3} + 2 | {2,6} | 0.425 | {3,4,5} + 3 | {15,40} | 0.0174339622641509 | {5,6,7} + 4 | {12,1} | 0.0379633360193392 | {3,4,5} + 5 | {2,90} | 0.00306428140577315 | {6,7,9} + 6 | {50,45} | 0.00214165229166379 | {6,7,8} +(6 rows) + + --- End diff -- I got the following error for this example (was running on Greenplum 5): ``` greenplum=# DROP TABLE IF EXISTS knn_result_classification; NOTICE: table "knn_result_classification" does not exist, skipping DROP TABLE greenplum=# SELECT * FROM madlib.knn( greenplum(# 'knn_train_data', -- Table of training data greenplum(# 'data',-- Col name of training data greenplum(# 'id', -- Col name of id in train data greenplum(# 'label', -- Training labels greenplum(# 'knn_test_data', -- Table of test data greenplum(# 'data',-- Col name of test data greenplum(# 'id', -- Col name of id in test data greenplum(# 'knn_result_classification', -- Output table greenplum(# 3,-- Number of nearest neighbors greenplum(# True, -- True to list nearest-neighbors by id greenplum(# 'madlib.squared_dist_norm2', -- Distance function greenplum(# True -- For weighted average greenplum(# ); ERROR: plpy.SPIError: function expression in FROM cannot refer to other relations of same query level LINE 15: a , unnest(k_nearest_neighbours)... ^ QUERY: CREATE TABLE knn_result_classification AS SELECT id, data ,max(prediction) as prediction , array_agg(distinct k_neighbours) AS k_nearest_neighbours FROM ( SELECT __madlib_temp_test_id_temp29900589_1516144312_53639332__ AS id, data ,sum(1/dist) AS prediction , array_agg(knn_temp.train_id ORDER BY knn_temp.dist ASC) AS k_nearest_neighbours FROM pg_temp.__madlib_temp_interim_table75130626_1516144312_10216040__ AS knn_temp JOIN knn_test_data AS knn_test ON knn_temp.__madlib_temp_test_id_temp29900589_1516144312_53639332__ = knn_test.id GROUP BY __madlib_temp_test_id_temp29900589_1516144312_53639332__ , data, __madlib_temp_label_col_temp66682446_1516144312_5242078__) a , unnest(k_nearest_neighbours) as k_neighbours GROUP BY id, data CONTEXT: Traceback (most recent call last): PL/Python function "knn", line 36, in weighted_avg PL/Python function "knn", line 242, in knn PL/Python function "knn" ``` This might be because some functions/features available in Postgres-9.x are not available in Greenplum. So we
[GitHub] madlib pull request #225: Added option for weighted average for both classif...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/225#discussion_r161917948 --- Diff: src/ports/postgres/modules/knn/knn.sql_in --- @@ -412,7 +451,8 @@ CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn( output_table, k, output_neighbors, -fn_dist +fn_dist, +weighted_avg --- End diff -- Two overloaded functions are missing (one seems to be an old issue, and the other is due to this PR): 1. ``` madlib.knn(point_source, point_column_name, point_id, label_column_name, test_source, test_column_name, test_id, output_table, k, output_neighbors, fn_dist ) ``` 2. ``` madlib.knn(point_source, point_column_name, point_id, label_column_name, test_source, test_column_name, test_id, output_table, k, output_neighbors ) ``` The first one is a call which does not have the last param specified, and the second function misses both the last two optional params. This should take in default values and work, but it currently fails. I just ran through the examples in the user docs, and got the following error for one of the examples: ``` greenplum=# SELECT * FROM madlib.knn( greenplum(# 'knn_train_data', -- Table of training data greenplum(# 'data',-- Col name of training data greenplum(# 'id', -- Col name of id in train data greenplum(# 'label', -- Training labels greenplum(# 'knn_test_data', -- Table of test data greenplum(# 'data',-- Col name of test data greenplum(# 'id', -- Col name of id in test data greenplum(# 'knn_result_classification', -- Output table greenplum(# 3,-- Number of nearest neighbors greenplum(# True, -- True to list nearest-neighbors by id greenplum(# 'madlib.squared_dist_norm2' -- Distance function greenplum(# ); ERROR: function madlib.knn(unknown, unknown, unknown, unknown, unknown, unknown, unknown, unknown, integer, boolean, unknown) does not exist LINE 1: SELECT * FROM madlib.knn( ^ HINT: No function matches the given name and argument types. You might need to add explicit type casts. ``` ---
[GitHub] madlib issue #226: Update MADlib version to dev
Github user asfgit commented on the issue: https://github.com/apache/madlib/pull/226 Refer to this link for build results (access rights to CI server needed): https://builds.apache.org/job/madlib-pr-build/325/ ---
Re: MADlib 1.13 community call, Jan 17 @ 1100 PST /1900 GMT
REMINDER: The MADlib team invites you to our 17 January community call (1100 PST/1900 GMT) to discuss the new features and improvements 1.13 release (webinar: https://pivotal.zoom.us/j/868705895): * New module: Graph - HITS (MADLIB-1124, MADLIB-1151) * k-NN: - Added additional distance metrics (MADLIB-1059) - Added list of neighbors in output table (MADLIB-1129) * MLP: Added grouping support (MADLIB-1149) * Cross Validation: Improved the stats reporting in output table (MADLIB-1169) * Correlation: Improved quality of results by ignoring only a NULL value and not the whole row containing the NULL (MADLIB-1166) * Multiple bug fixes See you then! Bob Glithero | Data Product Marketing Pivotal, Inc. rglith...@pivotal.io Bob Glithero | Data Product Marketing Pivotal, Inc. rglith...@pivotal.io | m: 415.483.5220 On Mon, Jan 8, 2018 at 10:57 AM, Robert Glitherowrote: > The MADlib team invites you to our 7 Sep community call to discuss the > new features and improvements 1.13 release (webinar: https://pivotal. > zoom.us/j/868705895): > > * New module: Graph - HITS (MADLIB-1124, MADLIB-1151) > * k-NN: > - Added additional distance metrics (MADLIB-1059) > - Added list of neighbors in output table (MADLIB-1129) > * MLP: Added grouping support (MADLIB-1149) > * Cross Validation: Improved the stats reporting in output table > (MADLIB-1169) > * Correlation: Improved quality of results by ignoring only a NULL value > and > not the whole row containing the NULL (MADLIB-1166) > * Multiple bug fixes > > See you then! > > > Bob Glithero | Data Product Marketing > Pivotal, Inc. > rglith...@pivotal.io >
[GitHub] madlib issue #225: Added option for weighted average for both classification...
Github user asfgit commented on the issue: https://github.com/apache/madlib/pull/225 Refer to this link for build results (access rights to CI server needed): https://builds.apache.org/job/madlib-pr-build/324/ ---
[GitHub] madlib pull request #225: Added option for weighted average for both classif...
GitHub user hpandeycodeit opened a pull request: https://github.com/apache/madlib/pull/225 Added option for weighted average for both classification and regress⦠Added option for weighted average for both classification and regression Models. Jira#1181 You can merge this pull request into a Git repository by running: $ git pull https://github.com/hpandeycodeit/incubator-madlib knn_dev_1181 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/225.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #225 commit b9d54d56b960f931f7ec4c7ce4525579be9f823c Author: hpandeycodeitDate: 2018-01-16T17:32:41Z Added option for weighted average for both classification and regression models ---
[GitHub] madlib issue #223: Balance datasets : re-sampling technique
Github user fmcquillan99 commented on the issue: https://github.com/apache/madlib/pull/223 Regarding (2) and (3) above, looks like it does not fail with `'red:7, blue:7'` but the MADlib convention is 'red=7, blue=7' so need to change to use `=`. (4) Seems to take only the 1st param in ``` DROP TABLE IF EXISTS output_table; SELECT madlib.balance_sample( 'flags', -- Source table 'output_table', -- Output table 'mainhue', -- Class column 'red:7, blue:7');-- Want 7 reds and 7 blues` SELECT * FROM output_table ORDER BY mainhue, name; ``` which produces 7 red but leaves 5 blue (should be 7) ``` id |name | landmass | zone | area | population | language | colours | mainhue +-+--+--+--++--+-+- 1 | Argentina |2 |3 | 2777 | 28 |2 | 2 | blue 2 | Australia |6 |2 | 7690 | 15 |1 | 3 | blue 8 | Greece |3 |1 | 132 | 10 |6 | 2 | blue 9 | Guatemala |1 |4 | 109 | 8 |2 | 2 | blue 17 | Sweden |3 |1 | 450 | 8 |6 | 2 | blue 4 | Brazil |2 |3 | 8512 |119 |6 | 4 | green 11 | Jamaica |1 |4 | 11 | 2 |1 | 3 | green 13 | Mexico |1 |4 | 1973 | 77 |2 | 4 | green 3 | Austria |3 |1 | 84 | 8 |4 | 2 | red 5 | Canada |1 |4 | 9976 | 24 |1 | 2 | red 7 | Denmark |3 |1 | 43 | 5 |6 | 2 | red 12 | Luxembourg |3 |1 |3 | 0 |4 | 3 | red 15 | Portugal|3 |4 | 92 | 10 |6 | 5 | red 18 | Switzerland |3 |1 | 41 | 6 |4 | 2 | red 19 | UK |3 |4 | 245 | 56 |1 | 3 | red 10 | Ireland |3 |4 | 70 | 3 |1 | 3 | white 20 | USA |1 |4 | 9363 |231 |1 | 3 | white (17 rows) ``` ---
[GitHub] madlib pull request #223: Balance datasets : re-sampling technique
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/223#discussion_r161297926 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -0,0 +1,994 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file EXCEPT in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +import math +import plpy +import re +from collections import defaultdict +from fractions import Fraction +from utilities.control import MinWarning +from utilities.utilities import _assert +from utilities.utilities import unique_string +from utilities.validate_args import table_exists +from utilities.validate_args import columns_exist_in_table +from utilities.validate_args import table_is_empty +from utilities.validate_args import get_cols +from utilities.utilities import py_list_to_sql_string + + +m4_changequote(`') + +def balance_sample(schema_madlib, source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, **kwargs): + +""" +Balance sampling function +Args: +@param source_table Input table name. +@param output_table Output table name. +@param class_col Name of the column containing the class to be + balanced. +@param class_size Parameter to define the size of the different + class values. +@param output_table_size Desired size of the output data set. +@param grouping_cols The columns columns that defines the grouping. +@param with_replacement The sampling method. + +""" +with MinWarning("warning"): + +class_counts = unique_string(desp='class_counts') +desired_sample_per_class = unique_string(desp='desired_sample_per_class') +desired_counts = unique_string(desp='desired_counts') + +if not class_sizes or class_sizes.strip().lower() in ('null', ''): +class_sizes = 'uniform' + +_validate_strs(source_table, output_table, class_col, class_sizes, +output_table_size, grouping_cols, with_replacement) + +source_table_columns = ','.join(get_cols(source_table)) +grp_by = "GROUP BY {0}".format(class_col) + +_create_frequency_distribution(class_counts, source_table, class_col) +temp_views = [class_counts] + +if class_sizes.lower() == 'undersample' and not with_replacement: +""" +Random undersample without replacement. +Randomly order the rows and give a unique (per class) +identifier to each one. +Select rows that have identifiers under the target limit. +""" +_undersampling_with_no_replacement(source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, +class_counts, source_table_columns) + +_delete_temp_views(temp_views) +return + +""" +Create views for true and desired sample sizes of classes +""" +""" +include_unsampled_classes tracks is unsampled classes are desired or not. +include_unsampled_classes is always true in output_table_size Null cases but changes given values of desired sample class sizes in comma-delimited classsize paramter. +""" +include_unsampled_classes = True +sampling_with_comma_delimited_class_sizes = class_sizes.find(':') > 0 + +if sampling_with_comma_delimited_class_sizes: +""" +Compute sample sizes based on +comman-delimited list of class_sizes +and/or output_table_size +""" +class_sizes, include_unsampled_classes =
[GitHub] madlib pull request #223: Balance datasets : re-sampling technique
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/223#discussion_r161850906 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -0,0 +1,994 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file EXCEPT in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +import math +import plpy +import re +from collections import defaultdict +from fractions import Fraction +from utilities.control import MinWarning +from utilities.utilities import _assert +from utilities.utilities import unique_string +from utilities.validate_args import table_exists +from utilities.validate_args import columns_exist_in_table +from utilities.validate_args import table_is_empty +from utilities.validate_args import get_cols +from utilities.utilities import py_list_to_sql_string + + +m4_changequote(`') + +def balance_sample(schema_madlib, source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, **kwargs): + +""" +Balance sampling function +Args: +@param source_table Input table name. +@param output_table Output table name. +@param class_col Name of the column containing the class to be + balanced. +@param class_size Parameter to define the size of the different + class values. +@param output_table_size Desired size of the output data set. +@param grouping_cols The columns columns that defines the grouping. +@param with_replacement The sampling method. + +""" +with MinWarning("warning"): + +class_counts = unique_string(desp='class_counts') +desired_sample_per_class = unique_string(desp='desired_sample_per_class') +desired_counts = unique_string(desp='desired_counts') + +if not class_sizes or class_sizes.strip().lower() in ('null', ''): +class_sizes = 'uniform' + +_validate_strs(source_table, output_table, class_col, class_sizes, +output_table_size, grouping_cols, with_replacement) + +source_table_columns = ','.join(get_cols(source_table)) +grp_by = "GROUP BY {0}".format(class_col) + +_create_frequency_distribution(class_counts, source_table, class_col) +temp_views = [class_counts] + +if class_sizes.lower() == 'undersample' and not with_replacement: +""" +Random undersample without replacement. +Randomly order the rows and give a unique (per class) +identifier to each one. +Select rows that have identifiers under the target limit. +""" +_undersampling_with_no_replacement(source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, +class_counts, source_table_columns) + +_delete_temp_views(temp_views) +return + +""" +Create views for true and desired sample sizes of classes +""" +""" +include_unsampled_classes tracks is unsampled classes are desired or not. +include_unsampled_classes is always true in output_table_size Null cases but changes given values of desired sample class sizes in comma-delimited classsize paramter. +""" +include_unsampled_classes = True +sampling_with_comma_delimited_class_sizes = class_sizes.find(':') > 0 + +if sampling_with_comma_delimited_class_sizes: +""" +Compute sample sizes based on +comman-delimited list of class_sizes +and/or output_table_size +""" +class_sizes, include_unsampled_classes =
[GitHub] madlib pull request #223: Balance datasets : re-sampling technique
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/223#discussion_r161299042 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -0,0 +1,994 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file EXCEPT in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +import math +import plpy +import re +from collections import defaultdict +from fractions import Fraction +from utilities.control import MinWarning +from utilities.utilities import _assert +from utilities.utilities import unique_string +from utilities.validate_args import table_exists +from utilities.validate_args import columns_exist_in_table +from utilities.validate_args import table_is_empty +from utilities.validate_args import get_cols +from utilities.utilities import py_list_to_sql_string + + +m4_changequote(`') + +def balance_sample(schema_madlib, source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, **kwargs): + +""" +Balance sampling function +Args: +@param source_table Input table name. +@param output_table Output table name. +@param class_col Name of the column containing the class to be + balanced. +@param class_size Parameter to define the size of the different + class values. +@param output_table_size Desired size of the output data set. +@param grouping_cols The columns columns that defines the grouping. +@param with_replacement The sampling method. + +""" +with MinWarning("warning"): + +class_counts = unique_string(desp='class_counts') +desired_sample_per_class = unique_string(desp='desired_sample_per_class') +desired_counts = unique_string(desp='desired_counts') + +if not class_sizes or class_sizes.strip().lower() in ('null', ''): +class_sizes = 'uniform' + +_validate_strs(source_table, output_table, class_col, class_sizes, +output_table_size, grouping_cols, with_replacement) + +source_table_columns = ','.join(get_cols(source_table)) +grp_by = "GROUP BY {0}".format(class_col) + +_create_frequency_distribution(class_counts, source_table, class_col) +temp_views = [class_counts] + +if class_sizes.lower() == 'undersample' and not with_replacement: +""" +Random undersample without replacement. +Randomly order the rows and give a unique (per class) +identifier to each one. +Select rows that have identifiers under the target limit. +""" +_undersampling_with_no_replacement(source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, +class_counts, source_table_columns) + +_delete_temp_views(temp_views) +return + +""" +Create views for true and desired sample sizes of classes +""" +""" +include_unsampled_classes tracks is unsampled classes are desired or not. +include_unsampled_classes is always true in output_table_size Null cases but changes given values of desired sample class sizes in comma-delimited classsize paramter. +""" +include_unsampled_classes = True +sampling_with_comma_delimited_class_sizes = class_sizes.find(':') > 0 + +if sampling_with_comma_delimited_class_sizes: +""" +Compute sample sizes based on +comman-delimited list of class_sizes +and/or output_table_size +""" +class_sizes, include_unsampled_classes =
[GitHub] madlib pull request #223: Balance datasets : re-sampling technique
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/223#discussion_r161296957 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -0,0 +1,994 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file EXCEPT in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +import math +import plpy +import re +from collections import defaultdict +from fractions import Fraction +from utilities.control import MinWarning +from utilities.utilities import _assert +from utilities.utilities import unique_string +from utilities.validate_args import table_exists +from utilities.validate_args import columns_exist_in_table +from utilities.validate_args import table_is_empty +from utilities.validate_args import get_cols +from utilities.utilities import py_list_to_sql_string + + +m4_changequote(`') + +def balance_sample(schema_madlib, source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, **kwargs): + +""" +Balance sampling function +Args: +@param source_table Input table name. +@param output_table Output table name. +@param class_col Name of the column containing the class to be + balanced. +@param class_size Parameter to define the size of the different + class values. +@param output_table_size Desired size of the output data set. +@param grouping_cols The columns columns that defines the grouping. +@param with_replacement The sampling method. + +""" +with MinWarning("warning"): + +class_counts = unique_string(desp='class_counts') +desired_sample_per_class = unique_string(desp='desired_sample_per_class') +desired_counts = unique_string(desp='desired_counts') + +if not class_sizes or class_sizes.strip().lower() in ('null', ''): +class_sizes = 'uniform' + +_validate_strs(source_table, output_table, class_col, class_sizes, +output_table_size, grouping_cols, with_replacement) + +source_table_columns = ','.join(get_cols(source_table)) +grp_by = "GROUP BY {0}".format(class_col) + +_create_frequency_distribution(class_counts, source_table, class_col) +temp_views = [class_counts] + +if class_sizes.lower() == 'undersample' and not with_replacement: +""" +Random undersample without replacement. +Randomly order the rows and give a unique (per class) +identifier to each one. +Select rows that have identifiers under the target limit. +""" +_undersampling_with_no_replacement(source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, +class_counts, source_table_columns) + +_delete_temp_views(temp_views) +return + +""" +Create views for true and desired sample sizes of classes +""" +""" +include_unsampled_classes tracks is unsampled classes are desired or not. +include_unsampled_classes is always true in output_table_size Null cases but changes given values of desired sample class sizes in comma-delimited classsize paramter. --- End diff -- is -> if ? ---
[GitHub] madlib pull request #223: Balance datasets : re-sampling technique
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/223#discussion_r161297074 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -0,0 +1,994 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file EXCEPT in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +import math +import plpy +import re +from collections import defaultdict +from fractions import Fraction +from utilities.control import MinWarning +from utilities.utilities import _assert +from utilities.utilities import unique_string +from utilities.validate_args import table_exists +from utilities.validate_args import columns_exist_in_table +from utilities.validate_args import table_is_empty +from utilities.validate_args import get_cols +from utilities.utilities import py_list_to_sql_string + + +m4_changequote(`') + +def balance_sample(schema_madlib, source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, **kwargs): + +""" +Balance sampling function +Args: +@param source_table Input table name. +@param output_table Output table name. +@param class_col Name of the column containing the class to be + balanced. +@param class_size Parameter to define the size of the different + class values. +@param output_table_size Desired size of the output data set. +@param grouping_cols The columns columns that defines the grouping. +@param with_replacement The sampling method. + +""" +with MinWarning("warning"): + +class_counts = unique_string(desp='class_counts') +desired_sample_per_class = unique_string(desp='desired_sample_per_class') +desired_counts = unique_string(desp='desired_counts') + +if not class_sizes or class_sizes.strip().lower() in ('null', ''): +class_sizes = 'uniform' + +_validate_strs(source_table, output_table, class_col, class_sizes, +output_table_size, grouping_cols, with_replacement) + +source_table_columns = ','.join(get_cols(source_table)) +grp_by = "GROUP BY {0}".format(class_col) + +_create_frequency_distribution(class_counts, source_table, class_col) +temp_views = [class_counts] + +if class_sizes.lower() == 'undersample' and not with_replacement: +""" +Random undersample without replacement. +Randomly order the rows and give a unique (per class) +identifier to each one. +Select rows that have identifiers under the target limit. +""" +_undersampling_with_no_replacement(source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, +class_counts, source_table_columns) + +_delete_temp_views(temp_views) +return + +""" +Create views for true and desired sample sizes of classes +""" +""" +include_unsampled_classes tracks is unsampled classes are desired or not. +include_unsampled_classes is always true in output_table_size Null cases but changes given values of desired sample class sizes in comma-delimited classsize paramter. +""" +include_unsampled_classes = True +sampling_with_comma_delimited_class_sizes = class_sizes.find(':') > 0 + +if sampling_with_comma_delimited_class_sizes: +""" +Compute sample sizes based on +comman-delimited list of class_sizes --- End diff -- comman -> comma ---
[GitHub] madlib pull request #223: Balance datasets : re-sampling technique
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/223#discussion_r161300298 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -0,0 +1,994 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file EXCEPT in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +import math +import plpy +import re +from collections import defaultdict +from fractions import Fraction +from utilities.control import MinWarning +from utilities.utilities import _assert +from utilities.utilities import unique_string +from utilities.validate_args import table_exists +from utilities.validate_args import columns_exist_in_table +from utilities.validate_args import table_is_empty +from utilities.validate_args import get_cols +from utilities.utilities import py_list_to_sql_string + + +m4_changequote(`') + +def balance_sample(schema_madlib, source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, **kwargs): + +""" +Balance sampling function +Args: +@param source_table Input table name. +@param output_table Output table name. +@param class_col Name of the column containing the class to be + balanced. +@param class_size Parameter to define the size of the different + class values. +@param output_table_size Desired size of the output data set. +@param grouping_cols The columns columns that defines the grouping. +@param with_replacement The sampling method. + +""" +with MinWarning("warning"): + +class_counts = unique_string(desp='class_counts') +desired_sample_per_class = unique_string(desp='desired_sample_per_class') +desired_counts = unique_string(desp='desired_counts') + +if not class_sizes or class_sizes.strip().lower() in ('null', ''): +class_sizes = 'uniform' + +_validate_strs(source_table, output_table, class_col, class_sizes, +output_table_size, grouping_cols, with_replacement) + +source_table_columns = ','.join(get_cols(source_table)) +grp_by = "GROUP BY {0}".format(class_col) + +_create_frequency_distribution(class_counts, source_table, class_col) +temp_views = [class_counts] + +if class_sizes.lower() == 'undersample' and not with_replacement: +""" +Random undersample without replacement. +Randomly order the rows and give a unique (per class) +identifier to each one. +Select rows that have identifiers under the target limit. +""" +_undersampling_with_no_replacement(source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, +class_counts, source_table_columns) + +_delete_temp_views(temp_views) +return + +""" +Create views for true and desired sample sizes of classes +""" +""" +include_unsampled_classes tracks is unsampled classes are desired or not. +include_unsampled_classes is always true in output_table_size Null cases but changes given values of desired sample class sizes in comma-delimited classsize paramter. +""" +include_unsampled_classes = True +sampling_with_comma_delimited_class_sizes = class_sizes.find(':') > 0 + +if sampling_with_comma_delimited_class_sizes: +""" +Compute sample sizes based on +comman-delimited list of class_sizes +and/or output_table_size +""" +class_sizes, include_unsampled_classes =
[GitHub] madlib pull request #223: Balance datasets : re-sampling technique
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/223#discussion_r161845440 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -0,0 +1,994 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file EXCEPT in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +import math +import plpy +import re +from collections import defaultdict +from fractions import Fraction +from utilities.control import MinWarning +from utilities.utilities import _assert +from utilities.utilities import unique_string +from utilities.validate_args import table_exists +from utilities.validate_args import columns_exist_in_table +from utilities.validate_args import table_is_empty +from utilities.validate_args import get_cols +from utilities.utilities import py_list_to_sql_string + + +m4_changequote(`') + +def balance_sample(schema_madlib, source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, **kwargs): + +""" +Balance sampling function +Args: +@param source_table Input table name. +@param output_table Output table name. +@param class_col Name of the column containing the class to be + balanced. +@param class_size Parameter to define the size of the different + class values. +@param output_table_size Desired size of the output data set. +@param grouping_cols The columns columns that defines the grouping. +@param with_replacement The sampling method. + +""" +with MinWarning("warning"): + +class_counts = unique_string(desp='class_counts') +desired_sample_per_class = unique_string(desp='desired_sample_per_class') +desired_counts = unique_string(desp='desired_counts') + +if not class_sizes or class_sizes.strip().lower() in ('null', ''): +class_sizes = 'uniform' + +_validate_strs(source_table, output_table, class_col, class_sizes, +output_table_size, grouping_cols, with_replacement) + +source_table_columns = ','.join(get_cols(source_table)) +grp_by = "GROUP BY {0}".format(class_col) + +_create_frequency_distribution(class_counts, source_table, class_col) +temp_views = [class_counts] + +if class_sizes.lower() == 'undersample' and not with_replacement: +""" +Random undersample without replacement. +Randomly order the rows and give a unique (per class) +identifier to each one. +Select rows that have identifiers under the target limit. +""" +_undersampling_with_no_replacement(source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, +class_counts, source_table_columns) + +_delete_temp_views(temp_views) +return + +""" +Create views for true and desired sample sizes of classes +""" +""" +include_unsampled_classes tracks is unsampled classes are desired or not. +include_unsampled_classes is always true in output_table_size Null cases but changes given values of desired sample class sizes in comma-delimited classsize paramter. +""" +include_unsampled_classes = True +sampling_with_comma_delimited_class_sizes = class_sizes.find(':') > 0 + +if sampling_with_comma_delimited_class_sizes: +""" +Compute sample sizes based on +comman-delimited list of class_sizes +and/or output_table_size +""" +class_sizes, include_unsampled_classes =
[GitHub] madlib pull request #223: Balance datasets : re-sampling technique
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/223#discussion_r161863965 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -0,0 +1,994 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file EXCEPT in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +import math +import plpy +import re +from collections import defaultdict +from fractions import Fraction +from utilities.control import MinWarning +from utilities.utilities import _assert +from utilities.utilities import unique_string +from utilities.validate_args import table_exists +from utilities.validate_args import columns_exist_in_table +from utilities.validate_args import table_is_empty +from utilities.validate_args import get_cols +from utilities.utilities import py_list_to_sql_string + + +m4_changequote(`') + +def balance_sample(schema_madlib, source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, **kwargs): + +""" +Balance sampling function +Args: +@param source_table Input table name. +@param output_table Output table name. +@param class_col Name of the column containing the class to be + balanced. +@param class_size Parameter to define the size of the different + class values. +@param output_table_size Desired size of the output data set. +@param grouping_cols The columns columns that defines the grouping. +@param with_replacement The sampling method. + +""" +with MinWarning("warning"): + +class_counts = unique_string(desp='class_counts') +desired_sample_per_class = unique_string(desp='desired_sample_per_class') +desired_counts = unique_string(desp='desired_counts') + +if not class_sizes or class_sizes.strip().lower() in ('null', ''): +class_sizes = 'uniform' + +_validate_strs(source_table, output_table, class_col, class_sizes, +output_table_size, grouping_cols, with_replacement) + +source_table_columns = ','.join(get_cols(source_table)) +grp_by = "GROUP BY {0}".format(class_col) + +_create_frequency_distribution(class_counts, source_table, class_col) +temp_views = [class_counts] + +if class_sizes.lower() == 'undersample' and not with_replacement: +""" +Random undersample without replacement. +Randomly order the rows and give a unique (per class) +identifier to each one. +Select rows that have identifiers under the target limit. +""" +_undersampling_with_no_replacement(source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, +class_counts, source_table_columns) + +_delete_temp_views(temp_views) +return + +""" +Create views for true and desired sample sizes of classes +""" +""" +include_unsampled_classes tracks is unsampled classes are desired or not. +include_unsampled_classes is always true in output_table_size Null cases but changes given values of desired sample class sizes in comma-delimited classsize paramter. +""" +include_unsampled_classes = True +sampling_with_comma_delimited_class_sizes = class_sizes.find(':') > 0 + +if sampling_with_comma_delimited_class_sizes: +""" +Compute sample sizes based on +comman-delimited list of class_sizes +and/or output_table_size +""" +class_sizes, include_unsampled_classes =
[GitHub] madlib pull request #223: Balance datasets : re-sampling technique
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/223#discussion_r161864354 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -0,0 +1,994 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file EXCEPT in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +import math +import plpy +import re +from collections import defaultdict +from fractions import Fraction +from utilities.control import MinWarning +from utilities.utilities import _assert +from utilities.utilities import unique_string +from utilities.validate_args import table_exists +from utilities.validate_args import columns_exist_in_table +from utilities.validate_args import table_is_empty +from utilities.validate_args import get_cols +from utilities.utilities import py_list_to_sql_string + + +m4_changequote(`') + +def balance_sample(schema_madlib, source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, **kwargs): + +""" +Balance sampling function +Args: +@param source_table Input table name. +@param output_table Output table name. +@param class_col Name of the column containing the class to be + balanced. +@param class_size Parameter to define the size of the different + class values. +@param output_table_size Desired size of the output data set. +@param grouping_cols The columns columns that defines the grouping. +@param with_replacement The sampling method. + +""" +with MinWarning("warning"): + +class_counts = unique_string(desp='class_counts') +desired_sample_per_class = unique_string(desp='desired_sample_per_class') +desired_counts = unique_string(desp='desired_counts') + +if not class_sizes or class_sizes.strip().lower() in ('null', ''): +class_sizes = 'uniform' + +_validate_strs(source_table, output_table, class_col, class_sizes, +output_table_size, grouping_cols, with_replacement) + +source_table_columns = ','.join(get_cols(source_table)) +grp_by = "GROUP BY {0}".format(class_col) + +_create_frequency_distribution(class_counts, source_table, class_col) +temp_views = [class_counts] + +if class_sizes.lower() == 'undersample' and not with_replacement: +""" +Random undersample without replacement. +Randomly order the rows and give a unique (per class) +identifier to each one. +Select rows that have identifiers under the target limit. +""" +_undersampling_with_no_replacement(source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, +class_counts, source_table_columns) + +_delete_temp_views(temp_views) +return + +""" +Create views for true and desired sample sizes of classes +""" +""" +include_unsampled_classes tracks is unsampled classes are desired or not. +include_unsampled_classes is always true in output_table_size Null cases but changes given values of desired sample class sizes in comma-delimited classsize paramter. +""" +include_unsampled_classes = True +sampling_with_comma_delimited_class_sizes = class_sizes.find(':') > 0 + +if sampling_with_comma_delimited_class_sizes: +""" +Compute sample sizes based on +comman-delimited list of class_sizes +and/or output_table_size +""" +class_sizes, include_unsampled_classes =
[GitHub] madlib pull request #223: Balance datasets : re-sampling technique
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/223#discussion_r161865238 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -0,0 +1,994 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file EXCEPT in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +import math +import plpy +import re +from collections import defaultdict +from fractions import Fraction +from utilities.control import MinWarning +from utilities.utilities import _assert +from utilities.utilities import unique_string +from utilities.validate_args import table_exists +from utilities.validate_args import columns_exist_in_table +from utilities.validate_args import table_is_empty +from utilities.validate_args import get_cols +from utilities.utilities import py_list_to_sql_string + + +m4_changequote(`') + +def balance_sample(schema_madlib, source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, **kwargs): + +""" +Balance sampling function +Args: +@param source_table Input table name. +@param output_table Output table name. +@param class_col Name of the column containing the class to be + balanced. +@param class_size Parameter to define the size of the different + class values. +@param output_table_size Desired size of the output data set. +@param grouping_cols The columns columns that defines the grouping. +@param with_replacement The sampling method. + +""" +with MinWarning("warning"): + +class_counts = unique_string(desp='class_counts') +desired_sample_per_class = unique_string(desp='desired_sample_per_class') +desired_counts = unique_string(desp='desired_counts') + +if not class_sizes or class_sizes.strip().lower() in ('null', ''): +class_sizes = 'uniform' + +_validate_strs(source_table, output_table, class_col, class_sizes, +output_table_size, grouping_cols, with_replacement) + +source_table_columns = ','.join(get_cols(source_table)) +grp_by = "GROUP BY {0}".format(class_col) + +_create_frequency_distribution(class_counts, source_table, class_col) +temp_views = [class_counts] + +if class_sizes.lower() == 'undersample' and not with_replacement: +""" +Random undersample without replacement. +Randomly order the rows and give a unique (per class) +identifier to each one. +Select rows that have identifiers under the target limit. +""" +_undersampling_with_no_replacement(source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, +class_counts, source_table_columns) + +_delete_temp_views(temp_views) +return + +""" +Create views for true and desired sample sizes of classes +""" +""" +include_unsampled_classes tracks is unsampled classes are desired or not. +include_unsampled_classes is always true in output_table_size Null cases but changes given values of desired sample class sizes in comma-delimited classsize paramter. +""" +include_unsampled_classes = True +sampling_with_comma_delimited_class_sizes = class_sizes.find(':') > 0 + +if sampling_with_comma_delimited_class_sizes: +""" +Compute sample sizes based on +comman-delimited list of class_sizes +and/or output_table_size +""" +class_sizes, include_unsampled_classes =