date:20180116

[GitHub] madlib pull request #225: Added option for weighted average for both classif...

2018-01-16 Thread njayaram2

Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/225#discussion_r161918108
  
--- Diff: src/ports/postgres/modules/knn/knn.sql_in ---
@@ -326,6 +331,39 @@ Result, with neighbors sorted from closest to furthest:
 (6 rows)
 
 
+
+-#   Run KNN for classification using the 
+weighted average:
+
+DROP TABLE IF EXISTS knn_result_classification;
+SELECT * FROM madlib.knn(
+'knn_train_data',  -- Table of training data
+'data',-- Col name of training data
+'id',  -- Col name of id in train data
+'label',   -- Training labels
+'knn_test_data',   -- Table of test data
+'data',-- Col name of test data
+'id',  -- Col name of id in test data
+'knn_result_classification',  -- Output table
+ 3,-- Number of nearest neighbors
+ True, -- True to list nearest-neighbors 
by id
+ 'madlib.squared_dist_norm2', -- Distance function
+ True -- For weighted average
+);
+SELECT * FROM knn_result_classification ORDER BY id;
+
+
+ id |  data   | prediction  | k_nearest_neighbours 
++-+-+--
+  1 | {2,1}   | 2.2 | {1,2,3}
+  2 | {2,6}   |   0.425 | {3,4,5}
+  3 | {15,40} |  0.0174339622641509 | {5,6,7}
+  4 | {12,1}  |  0.0379633360193392 | {3,4,5}
+  5 | {2,90}  | 0.00306428140577315 | {6,7,9}
+  6 | {50,45} | 0.00214165229166379 | {6,7,8}
+(6 rows)
+
+
--- End diff --

I got the following error for this example (was running on Greenplum 5):
```
greenplum=# DROP TABLE IF EXISTS knn_result_classification;
NOTICE:  table "knn_result_classification" does not exist, skipping
DROP TABLE
greenplum=# SELECT * FROM madlib.knn(
greenplum(# 'knn_train_data',  -- Table of training data
greenplum(# 'data',-- Col name of training 
data
greenplum(# 'id',  -- Col name of id in 
train data
greenplum(# 'label',   -- Training labels
greenplum(# 'knn_test_data',   -- Table of test data
greenplum(# 'data',-- Col name of test data
greenplum(# 'id',  -- Col name of id in 
test data
greenplum(# 'knn_result_classification',  -- Output table
greenplum(#  3,-- Number of nearest 
neighbors
greenplum(#  True, -- True to list 
nearest-neighbors by id
greenplum(#  'madlib.squared_dist_norm2', -- Distance 
function
greenplum(#  True -- For weighted average
greenplum(# );
ERROR:  plpy.SPIError: function expression in FROM cannot refer to other 
relations of same query level
LINE 15: a , unnest(k_nearest_neighbours)...
^
QUERY:
CREATE TABLE knn_result_classification AS
SELECT id, data ,max(prediction) as prediction
, array_agg(distinct k_neighbours) AS 
k_nearest_neighbours
FROM
( SELECT 
__madlib_temp_test_id_temp29900589_1516144312_53639332__ AS id, data
,sum(1/dist) AS prediction
, array_agg(knn_temp.train_id ORDER BY 
knn_temp.dist ASC) AS k_nearest_neighbours
FROM 
pg_temp.__madlib_temp_interim_table75130626_1516144312_10216040__ AS knn_temp
JOIN
knn_test_data AS knn_test ON

knn_temp.__madlib_temp_test_id_temp29900589_1516144312_53639332__ = knn_test.id
GROUP BY 
__madlib_temp_test_id_temp29900589_1516144312_53639332__ ,
data, 
__madlib_temp_label_col_temp66682446_1516144312_5242078__)
a , unnest(k_nearest_neighbours) as k_neighbours
GROUP BY id, data

CONTEXT:  Traceback (most recent call last):
  PL/Python function "knn", line 36, in 
weighted_avg
  PL/Python function "knn", line 242, in knn
PL/Python function "knn"
```

This might be because some functions/features available in Postgres-9.x are 
not available in Greenplum. So we

[GitHub] madlib pull request #225: Added option for weighted average for both classif...

2018-01-16 Thread njayaram2

Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/225#discussion_r161917948
  
--- Diff: src/ports/postgres/modules/knn/knn.sql_in ---
@@ -412,7 +451,8 @@ CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn(
 output_table,
 k,
 output_neighbors,
-fn_dist
+fn_dist,
+weighted_avg
--- End diff --

Two overloaded functions are missing (one seems to be an old issue, and the 
other is due to this PR):
1.
```
madlib.knn(point_source,
point_column_name,
point_id,
label_column_name,
test_source,
test_column_name,
test_id,
output_table,
k,
output_neighbors,
fn_dist
)
```
2.
```
madlib.knn(point_source,
point_column_name,
point_id,
label_column_name,
test_source,
test_column_name,
test_id,
output_table,
k,
output_neighbors
)
```

The first one is a call which does not have the last param specified, and 
the second function misses both the last two optional params. This should take 
in default values and work, but it currently fails.

I just ran through the examples in the user docs, and got the following 
error for one of the examples:
```
greenplum=# SELECT * FROM madlib.knn(
greenplum(# 'knn_train_data',  -- Table of training data
greenplum(# 'data',-- Col name of training 
data
greenplum(# 'id',  -- Col name of id in 
train data
greenplum(# 'label',   -- Training labels
greenplum(# 'knn_test_data',   -- Table of test data
greenplum(# 'data',-- Col name of test data
greenplum(# 'id',  -- Col name of id in 
test data
greenplum(# 'knn_result_classification',  -- Output table
greenplum(#  3,-- Number of nearest 
neighbors
greenplum(#  True, -- True to list 
nearest-neighbors by id
greenplum(#  'madlib.squared_dist_norm2' -- Distance 
function
greenplum(# );
ERROR:  function madlib.knn(unknown, unknown, unknown, unknown, unknown, 
unknown, unknown, unknown, integer, boolean, unknown) does not exist
LINE 1: SELECT * FROM madlib.knn(
  ^
HINT:  No function matches the given name and argument types. You might 
need to add explicit type casts.
```



---

[GitHub] madlib issue #226: Update MADlib version to dev

2018-01-16 Thread asfgit

Github user asfgit commented on the issue:

https://github.com/apache/madlib/pull/226
  

Refer to this link for build results (access rights to CI server needed): 
https://builds.apache.org/job/madlib-pr-build/325/



---

Re: MADlib 1.13 community call, Jan 17 @ 1100 PST /1900 GMT

2018-01-16 Thread Robert Glithero

REMINDER:   The MADlib team invites you to our 17 January community call
(1100 PST/1900 GMT) to discuss the new features and improvements 1.13
release (webinar: https://pivotal.zoom.us/j/868705895):

* New module: Graph - HITS (MADLIB-1124, MADLIB-1151)
* k-NN:
- Added additional distance metrics (MADLIB-1059)
- Added list of neighbors in output table (MADLIB-1129)
* MLP: Added grouping support (MADLIB-1149)
* Cross Validation: Improved the stats reporting in output table
(MADLIB-1169)
* Correlation: Improved quality of results by ignoring only a NULL value and
not the whole row containing the NULL (MADLIB-1166)
* Multiple bug fixes

See you then!


Bob Glithero | Data Product Marketing
Pivotal, Inc.
rglith...@pivotal.io


Bob Glithero | Data Product Marketing
Pivotal, Inc.
rglith...@pivotal.io | m: 415.483.5220


On Mon, Jan 8, 2018 at 10:57 AM, Robert Glithero 
wrote:

> The MADlib team invites you to our  7 Sep community call to discuss the
> new features and improvements 1.13 release (webinar: https://pivotal.
> zoom.us/j/868705895):
>
> * New module: Graph - HITS (MADLIB-1124, MADLIB-1151)
> * k-NN:
> - Added additional distance metrics (MADLIB-1059)
> - Added list of neighbors in output table (MADLIB-1129)
> * MLP: Added grouping support (MADLIB-1149)
> * Cross Validation: Improved the stats reporting in output table
> (MADLIB-1169)
> * Correlation: Improved quality of results by ignoring only a NULL value
> and
> not the whole row containing the NULL (MADLIB-1166)
> * Multiple bug fixes
>
> See you then!
>
>
> Bob Glithero | Data Product Marketing
> Pivotal, Inc.
> rglith...@pivotal.io
>

[GitHub] madlib issue #225: Added option for weighted average for both classification...

2018-01-16 Thread asfgit

Github user asfgit commented on the issue:

https://github.com/apache/madlib/pull/225
  

Refer to this link for build results (access rights to CI server needed): 
https://builds.apache.org/job/madlib-pr-build/324/



---

[GitHub] madlib pull request #225: Added option for weighted average for both classif...

2018-01-16 Thread hpandeycodeit

GitHub user hpandeycodeit opened a pull request:

https://github.com/apache/madlib/pull/225

Added option for weighted average for both classification and regressâ¦

Added option for weighted average for both classification and regression 
Models. Jira#1181

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hpandeycodeit/incubator-madlib knn_dev_1181

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/225.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #225


commit b9d54d56b960f931f7ec4c7ce4525579be9f823c
Author: hpandeycodeit 
Date:   2018-01-16T17:32:41Z

Added option for weighted average for both classification and regression 
models




---

[GitHub] madlib issue #223: Balance datasets : re-sampling technique

2018-01-16 Thread fmcquillan99

Github user fmcquillan99 commented on the issue:

https://github.com/apache/madlib/pull/223
  
Regarding (2) and (3) above,  looks like it does not fail with `'red:7, 
blue:7'` but the MADlib convention is 'red=7, blue=7' so need to change to use 
`=`.

(4)
Seems to take only the 1st param in 
```
DROP TABLE IF EXISTS output_table;
SELECT madlib.balance_sample(
  'flags', -- Source table
  'output_table',  -- Output table
  'mainhue',   -- Class column
  'red:7, blue:7');-- Want 7 reds and 7 
blues`
SELECT * FROM output_table ORDER BY mainhue, name;
```
which produces 7 red but leaves 5 blue (should be 7)
```
  id |name | landmass | zone | area | population | language | 
colours | mainhue 

+-+--+--+--++--+-+-
  1 | Argentina   |2 |3 | 2777 | 28 |2 |   
2 | blue
  2 | Australia   |6 |2 | 7690 | 15 |1 |   
3 | blue
  8 | Greece  |3 |1 |  132 | 10 |6 |   
2 | blue
  9 | Guatemala   |1 |4 |  109 |  8 |2 |   
2 | blue
 17 | Sweden  |3 |1 |  450 |  8 |6 |   
2 | blue
  4 | Brazil  |2 |3 | 8512 |119 |6 |   
4 | green
 11 | Jamaica |1 |4 |   11 |  2 |1 |   
3 | green
 13 | Mexico  |1 |4 | 1973 | 77 |2 |   
4 | green
  3 | Austria |3 |1 |   84 |  8 |4 |   
2 | red
  5 | Canada  |1 |4 | 9976 | 24 |1 |   
2 | red
  7 | Denmark |3 |1 |   43 |  5 |6 |   
2 | red
 12 | Luxembourg  |3 |1 |3 |  0 |4 |   
3 | red
 15 | Portugal|3 |4 |   92 | 10 |6 |   
5 | red
 18 | Switzerland |3 |1 |   41 |  6 |4 |   
2 | red
 19 | UK  |3 |4 |  245 | 56 |1 |   
3 | red
 10 | Ireland |3 |4 |   70 |  3 |1 |   
3 | white
 20 | USA |1 |4 | 9363 |231 |1 |   
3 | white
(17 rows)
```


---

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

2018-01-16 Thread orhankislal

Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/223#discussion_r161297926
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,994 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import math
+import plpy
+import re
+from collections import defaultdict
+from fractions import Fraction
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import unique_string
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+from utilities.validate_args import get_cols
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`')
+
+def balance_sample(schema_madlib, source_table, output_table, class_col,
+class_sizes, output_table_size, grouping_cols, with_replacement, 
**kwargs):
+
+"""
+Balance sampling function
+Args:
+@param source_table   Input table name.
+@param output_table   Output table name.
+@param class_col  Name of the column containing the class 
to be
+  balanced.
+@param class_size Parameter to define the size of the 
different
+  class values.
+@param output_table_size  Desired size of the output data set.
+@param grouping_cols  The columns columns that defines the 
grouping.
+@param with_replacement   The sampling method.
+
+"""
+with MinWarning("warning"):
+
+class_counts = unique_string(desp='class_counts')
+desired_sample_per_class = 
unique_string(desp='desired_sample_per_class')
+desired_counts = unique_string(desp='desired_counts')
+
+if not class_sizes or class_sizes.strip().lower() in ('null', ''):
+class_sizes = 'uniform'
+
+_validate_strs(source_table, output_table, class_col, class_sizes,
+output_table_size, grouping_cols, with_replacement)
+
+source_table_columns = ','.join(get_cols(source_table))
+grp_by = "GROUP BY {0}".format(class_col)
+
+_create_frequency_distribution(class_counts, source_table, 
class_col)
+temp_views = [class_counts]
+
+if class_sizes.lower() == 'undersample' and not with_replacement:
+"""
+Random undersample without replacement.
+Randomly order the rows and give a unique (per class)
+identifier to each one.
+Select rows that have identifiers under the target limit.
+"""
+_undersampling_with_no_replacement(source_table, output_table, 
class_col,
+class_sizes, output_table_size, grouping_cols, 
with_replacement,
+class_counts, source_table_columns)
+
+_delete_temp_views(temp_views)
+return
+
+"""
+Create views for true and desired sample sizes of classes
+"""
+"""
+include_unsampled_classes tracks is unsampled classes are 
desired or not.
+include_unsampled_classes is always true in output_table_size 
Null cases but changes given values of desired sample class sizes in 
comma-delimited classsize paramter.
+"""
+include_unsampled_classes = True
+sampling_with_comma_delimited_class_sizes = class_sizes.find(':') 
> 0
+
+if sampling_with_comma_delimited_class_sizes:
+"""
+Compute sample sizes based on
+comman-delimited list of class_sizes
+and/or output_table_size
+"""
+class_sizes, include_unsampled_classes =

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

2018-01-16 Thread orhankislal

Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/223#discussion_r161850906
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,994 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import math
+import plpy
+import re
+from collections import defaultdict
+from fractions import Fraction
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import unique_string
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+from utilities.validate_args import get_cols
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`')
+
+def balance_sample(schema_madlib, source_table, output_table, class_col,
+class_sizes, output_table_size, grouping_cols, with_replacement, 
**kwargs):
+
+"""
+Balance sampling function
+Args:
+@param source_table   Input table name.
+@param output_table   Output table name.
+@param class_col  Name of the column containing the class 
to be
+  balanced.
+@param class_size Parameter to define the size of the 
different
+  class values.
+@param output_table_size  Desired size of the output data set.
+@param grouping_cols  The columns columns that defines the 
grouping.
+@param with_replacement   The sampling method.
+
+"""
+with MinWarning("warning"):
+
+class_counts = unique_string(desp='class_counts')
+desired_sample_per_class = 
unique_string(desp='desired_sample_per_class')
+desired_counts = unique_string(desp='desired_counts')
+
+if not class_sizes or class_sizes.strip().lower() in ('null', ''):
+class_sizes = 'uniform'
+
+_validate_strs(source_table, output_table, class_col, class_sizes,
+output_table_size, grouping_cols, with_replacement)
+
+source_table_columns = ','.join(get_cols(source_table))
+grp_by = "GROUP BY {0}".format(class_col)
+
+_create_frequency_distribution(class_counts, source_table, 
class_col)
+temp_views = [class_counts]
+
+if class_sizes.lower() == 'undersample' and not with_replacement:
+"""
+Random undersample without replacement.
+Randomly order the rows and give a unique (per class)
+identifier to each one.
+Select rows that have identifiers under the target limit.
+"""
+_undersampling_with_no_replacement(source_table, output_table, 
class_col,
+class_sizes, output_table_size, grouping_cols, 
with_replacement,
+class_counts, source_table_columns)
+
+_delete_temp_views(temp_views)
+return
+
+"""
+Create views for true and desired sample sizes of classes
+"""
+"""
+include_unsampled_classes tracks is unsampled classes are 
desired or not.
+include_unsampled_classes is always true in output_table_size 
Null cases but changes given values of desired sample class sizes in 
comma-delimited classsize paramter.
+"""
+include_unsampled_classes = True
+sampling_with_comma_delimited_class_sizes = class_sizes.find(':') 
> 0
+
+if sampling_with_comma_delimited_class_sizes:
+"""
+Compute sample sizes based on
+comman-delimited list of class_sizes
+and/or output_table_size
+"""
+class_sizes, include_unsampled_classes =

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

2018-01-16 Thread orhankislal

Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/223#discussion_r161299042
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,994 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import math
+import plpy
+import re
+from collections import defaultdict
+from fractions import Fraction
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import unique_string
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+from utilities.validate_args import get_cols
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`')
+
+def balance_sample(schema_madlib, source_table, output_table, class_col,
+class_sizes, output_table_size, grouping_cols, with_replacement, 
**kwargs):
+
+"""
+Balance sampling function
+Args:
+@param source_table   Input table name.
+@param output_table   Output table name.
+@param class_col  Name of the column containing the class 
to be
+  balanced.
+@param class_size Parameter to define the size of the 
different
+  class values.
+@param output_table_size  Desired size of the output data set.
+@param grouping_cols  The columns columns that defines the 
grouping.
+@param with_replacement   The sampling method.
+
+"""
+with MinWarning("warning"):
+
+class_counts = unique_string(desp='class_counts')
+desired_sample_per_class = 
unique_string(desp='desired_sample_per_class')
+desired_counts = unique_string(desp='desired_counts')
+
+if not class_sizes or class_sizes.strip().lower() in ('null', ''):
+class_sizes = 'uniform'
+
+_validate_strs(source_table, output_table, class_col, class_sizes,
+output_table_size, grouping_cols, with_replacement)
+
+source_table_columns = ','.join(get_cols(source_table))
+grp_by = "GROUP BY {0}".format(class_col)
+
+_create_frequency_distribution(class_counts, source_table, 
class_col)
+temp_views = [class_counts]
+
+if class_sizes.lower() == 'undersample' and not with_replacement:
+"""
+Random undersample without replacement.
+Randomly order the rows and give a unique (per class)
+identifier to each one.
+Select rows that have identifiers under the target limit.
+"""
+_undersampling_with_no_replacement(source_table, output_table, 
class_col,
+class_sizes, output_table_size, grouping_cols, 
with_replacement,
+class_counts, source_table_columns)
+
+_delete_temp_views(temp_views)
+return
+
+"""
+Create views for true and desired sample sizes of classes
+"""
+"""
+include_unsampled_classes tracks is unsampled classes are 
desired or not.
+include_unsampled_classes is always true in output_table_size 
Null cases but changes given values of desired sample class sizes in 
comma-delimited classsize paramter.
+"""
+include_unsampled_classes = True
+sampling_with_comma_delimited_class_sizes = class_sizes.find(':') 
> 0
+
+if sampling_with_comma_delimited_class_sizes:
+"""
+Compute sample sizes based on
+comman-delimited list of class_sizes
+and/or output_table_size
+"""
+class_sizes, include_unsampled_classes =

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

2018-01-16 Thread orhankislal

Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/223#discussion_r161296957
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,994 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import math
+import plpy
+import re
+from collections import defaultdict
+from fractions import Fraction
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import unique_string
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+from utilities.validate_args import get_cols
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`')
+
+def balance_sample(schema_madlib, source_table, output_table, class_col,
+class_sizes, output_table_size, grouping_cols, with_replacement, 
**kwargs):
+
+"""
+Balance sampling function
+Args:
+@param source_table   Input table name.
+@param output_table   Output table name.
+@param class_col  Name of the column containing the class 
to be
+  balanced.
+@param class_size Parameter to define the size of the 
different
+  class values.
+@param output_table_size  Desired size of the output data set.
+@param grouping_cols  The columns columns that defines the 
grouping.
+@param with_replacement   The sampling method.
+
+"""
+with MinWarning("warning"):
+
+class_counts = unique_string(desp='class_counts')
+desired_sample_per_class = 
unique_string(desp='desired_sample_per_class')
+desired_counts = unique_string(desp='desired_counts')
+
+if not class_sizes or class_sizes.strip().lower() in ('null', ''):
+class_sizes = 'uniform'
+
+_validate_strs(source_table, output_table, class_col, class_sizes,
+output_table_size, grouping_cols, with_replacement)
+
+source_table_columns = ','.join(get_cols(source_table))
+grp_by = "GROUP BY {0}".format(class_col)
+
+_create_frequency_distribution(class_counts, source_table, 
class_col)
+temp_views = [class_counts]
+
+if class_sizes.lower() == 'undersample' and not with_replacement:
+"""
+Random undersample without replacement.
+Randomly order the rows and give a unique (per class)
+identifier to each one.
+Select rows that have identifiers under the target limit.
+"""
+_undersampling_with_no_replacement(source_table, output_table, 
class_col,
+class_sizes, output_table_size, grouping_cols, 
with_replacement,
+class_counts, source_table_columns)
+
+_delete_temp_views(temp_views)
+return
+
+"""
+Create views for true and desired sample sizes of classes
+"""
+"""
+include_unsampled_classes tracks is unsampled classes are 
desired or not.
+include_unsampled_classes is always true in output_table_size 
Null cases but changes given values of desired sample class sizes in 
comma-delimited classsize paramter.
--- End diff --

is -> if ?


---

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

2018-01-16 Thread orhankislal

Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/223#discussion_r161297074
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,994 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import math
+import plpy
+import re
+from collections import defaultdict
+from fractions import Fraction
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import unique_string
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+from utilities.validate_args import get_cols
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`')
+
+def balance_sample(schema_madlib, source_table, output_table, class_col,
+class_sizes, output_table_size, grouping_cols, with_replacement, 
**kwargs):
+
+"""
+Balance sampling function
+Args:
+@param source_table   Input table name.
+@param output_table   Output table name.
+@param class_col  Name of the column containing the class 
to be
+  balanced.
+@param class_size Parameter to define the size of the 
different
+  class values.
+@param output_table_size  Desired size of the output data set.
+@param grouping_cols  The columns columns that defines the 
grouping.
+@param with_replacement   The sampling method.
+
+"""
+with MinWarning("warning"):
+
+class_counts = unique_string(desp='class_counts')
+desired_sample_per_class = 
unique_string(desp='desired_sample_per_class')
+desired_counts = unique_string(desp='desired_counts')
+
+if not class_sizes or class_sizes.strip().lower() in ('null', ''):
+class_sizes = 'uniform'
+
+_validate_strs(source_table, output_table, class_col, class_sizes,
+output_table_size, grouping_cols, with_replacement)
+
+source_table_columns = ','.join(get_cols(source_table))
+grp_by = "GROUP BY {0}".format(class_col)
+
+_create_frequency_distribution(class_counts, source_table, 
class_col)
+temp_views = [class_counts]
+
+if class_sizes.lower() == 'undersample' and not with_replacement:
+"""
+Random undersample without replacement.
+Randomly order the rows and give a unique (per class)
+identifier to each one.
+Select rows that have identifiers under the target limit.
+"""
+_undersampling_with_no_replacement(source_table, output_table, 
class_col,
+class_sizes, output_table_size, grouping_cols, 
with_replacement,
+class_counts, source_table_columns)
+
+_delete_temp_views(temp_views)
+return
+
+"""
+Create views for true and desired sample sizes of classes
+"""
+"""
+include_unsampled_classes tracks is unsampled classes are 
desired or not.
+include_unsampled_classes is always true in output_table_size 
Null cases but changes given values of desired sample class sizes in 
comma-delimited classsize paramter.
+"""
+include_unsampled_classes = True
+sampling_with_comma_delimited_class_sizes = class_sizes.find(':') 
> 0
+
+if sampling_with_comma_delimited_class_sizes:
+"""
+Compute sample sizes based on
+comman-delimited list of class_sizes
--- End diff --

comman -> comma


---

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

2018-01-16 Thread orhankislal

Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/223#discussion_r161300298
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,994 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import math
+import plpy
+import re
+from collections import defaultdict
+from fractions import Fraction
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import unique_string
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+from utilities.validate_args import get_cols
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`')
+
+def balance_sample(schema_madlib, source_table, output_table, class_col,
+class_sizes, output_table_size, grouping_cols, with_replacement, 
**kwargs):
+
+"""
+Balance sampling function
+Args:
+@param source_table   Input table name.
+@param output_table   Output table name.
+@param class_col  Name of the column containing the class 
to be
+  balanced.
+@param class_size Parameter to define the size of the 
different
+  class values.
+@param output_table_size  Desired size of the output data set.
+@param grouping_cols  The columns columns that defines the 
grouping.
+@param with_replacement   The sampling method.
+
+"""
+with MinWarning("warning"):
+
+class_counts = unique_string(desp='class_counts')
+desired_sample_per_class = 
unique_string(desp='desired_sample_per_class')
+desired_counts = unique_string(desp='desired_counts')
+
+if not class_sizes or class_sizes.strip().lower() in ('null', ''):
+class_sizes = 'uniform'
+
+_validate_strs(source_table, output_table, class_col, class_sizes,
+output_table_size, grouping_cols, with_replacement)
+
+source_table_columns = ','.join(get_cols(source_table))
+grp_by = "GROUP BY {0}".format(class_col)
+
+_create_frequency_distribution(class_counts, source_table, 
class_col)
+temp_views = [class_counts]
+
+if class_sizes.lower() == 'undersample' and not with_replacement:
+"""
+Random undersample without replacement.
+Randomly order the rows and give a unique (per class)
+identifier to each one.
+Select rows that have identifiers under the target limit.
+"""
+_undersampling_with_no_replacement(source_table, output_table, 
class_col,
+class_sizes, output_table_size, grouping_cols, 
with_replacement,
+class_counts, source_table_columns)
+
+_delete_temp_views(temp_views)
+return
+
+"""
+Create views for true and desired sample sizes of classes
+"""
+"""
+include_unsampled_classes tracks is unsampled classes are 
desired or not.
+include_unsampled_classes is always true in output_table_size 
Null cases but changes given values of desired sample class sizes in 
comma-delimited classsize paramter.
+"""
+include_unsampled_classes = True
+sampling_with_comma_delimited_class_sizes = class_sizes.find(':') 
> 0
+
+if sampling_with_comma_delimited_class_sizes:
+"""
+Compute sample sizes based on
+comman-delimited list of class_sizes
+and/or output_table_size
+"""
+class_sizes, include_unsampled_classes =

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

2018-01-16 Thread orhankislal

Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/223#discussion_r161845440
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,994 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import math
+import plpy
+import re
+from collections import defaultdict
+from fractions import Fraction
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import unique_string
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+from utilities.validate_args import get_cols
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`')
+
+def balance_sample(schema_madlib, source_table, output_table, class_col,
+class_sizes, output_table_size, grouping_cols, with_replacement, 
**kwargs):
+
+"""
+Balance sampling function
+Args:
+@param source_table   Input table name.
+@param output_table   Output table name.
+@param class_col  Name of the column containing the class 
to be
+  balanced.
+@param class_size Parameter to define the size of the 
different
+  class values.
+@param output_table_size  Desired size of the output data set.
+@param grouping_cols  The columns columns that defines the 
grouping.
+@param with_replacement   The sampling method.
+
+"""
+with MinWarning("warning"):
+
+class_counts = unique_string(desp='class_counts')
+desired_sample_per_class = 
unique_string(desp='desired_sample_per_class')
+desired_counts = unique_string(desp='desired_counts')
+
+if not class_sizes or class_sizes.strip().lower() in ('null', ''):
+class_sizes = 'uniform'
+
+_validate_strs(source_table, output_table, class_col, class_sizes,
+output_table_size, grouping_cols, with_replacement)
+
+source_table_columns = ','.join(get_cols(source_table))
+grp_by = "GROUP BY {0}".format(class_col)
+
+_create_frequency_distribution(class_counts, source_table, 
class_col)
+temp_views = [class_counts]
+
+if class_sizes.lower() == 'undersample' and not with_replacement:
+"""
+Random undersample without replacement.
+Randomly order the rows and give a unique (per class)
+identifier to each one.
+Select rows that have identifiers under the target limit.
+"""
+_undersampling_with_no_replacement(source_table, output_table, 
class_col,
+class_sizes, output_table_size, grouping_cols, 
with_replacement,
+class_counts, source_table_columns)
+
+_delete_temp_views(temp_views)
+return
+
+"""
+Create views for true and desired sample sizes of classes
+"""
+"""
+include_unsampled_classes tracks is unsampled classes are 
desired or not.
+include_unsampled_classes is always true in output_table_size 
Null cases but changes given values of desired sample class sizes in 
comma-delimited classsize paramter.
+"""
+include_unsampled_classes = True
+sampling_with_comma_delimited_class_sizes = class_sizes.find(':') 
> 0
+
+if sampling_with_comma_delimited_class_sizes:
+"""
+Compute sample sizes based on
+comman-delimited list of class_sizes
+and/or output_table_size
+"""
+class_sizes, include_unsampled_classes =

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

2018-01-16 Thread orhankislal

Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/223#discussion_r161863965
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,994 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import math
+import plpy
+import re
+from collections import defaultdict
+from fractions import Fraction
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import unique_string
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+from utilities.validate_args import get_cols
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`')
+
+def balance_sample(schema_madlib, source_table, output_table, class_col,
+class_sizes, output_table_size, grouping_cols, with_replacement, 
**kwargs):
+
+"""
+Balance sampling function
+Args:
+@param source_table   Input table name.
+@param output_table   Output table name.
+@param class_col  Name of the column containing the class 
to be
+  balanced.
+@param class_size Parameter to define the size of the 
different
+  class values.
+@param output_table_size  Desired size of the output data set.
+@param grouping_cols  The columns columns that defines the 
grouping.
+@param with_replacement   The sampling method.
+
+"""
+with MinWarning("warning"):
+
+class_counts = unique_string(desp='class_counts')
+desired_sample_per_class = 
unique_string(desp='desired_sample_per_class')
+desired_counts = unique_string(desp='desired_counts')
+
+if not class_sizes or class_sizes.strip().lower() in ('null', ''):
+class_sizes = 'uniform'
+
+_validate_strs(source_table, output_table, class_col, class_sizes,
+output_table_size, grouping_cols, with_replacement)
+
+source_table_columns = ','.join(get_cols(source_table))
+grp_by = "GROUP BY {0}".format(class_col)
+
+_create_frequency_distribution(class_counts, source_table, 
class_col)
+temp_views = [class_counts]
+
+if class_sizes.lower() == 'undersample' and not with_replacement:
+"""
+Random undersample without replacement.
+Randomly order the rows and give a unique (per class)
+identifier to each one.
+Select rows that have identifiers under the target limit.
+"""
+_undersampling_with_no_replacement(source_table, output_table, 
class_col,
+class_sizes, output_table_size, grouping_cols, 
with_replacement,
+class_counts, source_table_columns)
+
+_delete_temp_views(temp_views)
+return
+
+"""
+Create views for true and desired sample sizes of classes
+"""
+"""
+include_unsampled_classes tracks is unsampled classes are 
desired or not.
+include_unsampled_classes is always true in output_table_size 
Null cases but changes given values of desired sample class sizes in 
comma-delimited classsize paramter.
+"""
+include_unsampled_classes = True
+sampling_with_comma_delimited_class_sizes = class_sizes.find(':') 
> 0
+
+if sampling_with_comma_delimited_class_sizes:
+"""
+Compute sample sizes based on
+comman-delimited list of class_sizes
+and/or output_table_size
+"""
+class_sizes, include_unsampled_classes =

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

2018-01-16 Thread orhankislal

Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/223#discussion_r161864354
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,994 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import math
+import plpy
+import re
+from collections import defaultdict
+from fractions import Fraction
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import unique_string
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+from utilities.validate_args import get_cols
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`')
+
+def balance_sample(schema_madlib, source_table, output_table, class_col,
+class_sizes, output_table_size, grouping_cols, with_replacement, 
**kwargs):
+
+"""
+Balance sampling function
+Args:
+@param source_table   Input table name.
+@param output_table   Output table name.
+@param class_col  Name of the column containing the class 
to be
+  balanced.
+@param class_size Parameter to define the size of the 
different
+  class values.
+@param output_table_size  Desired size of the output data set.
+@param grouping_cols  The columns columns that defines the 
grouping.
+@param with_replacement   The sampling method.
+
+"""
+with MinWarning("warning"):
+
+class_counts = unique_string(desp='class_counts')
+desired_sample_per_class = 
unique_string(desp='desired_sample_per_class')
+desired_counts = unique_string(desp='desired_counts')
+
+if not class_sizes or class_sizes.strip().lower() in ('null', ''):
+class_sizes = 'uniform'
+
+_validate_strs(source_table, output_table, class_col, class_sizes,
+output_table_size, grouping_cols, with_replacement)
+
+source_table_columns = ','.join(get_cols(source_table))
+grp_by = "GROUP BY {0}".format(class_col)
+
+_create_frequency_distribution(class_counts, source_table, 
class_col)
+temp_views = [class_counts]
+
+if class_sizes.lower() == 'undersample' and not with_replacement:
+"""
+Random undersample without replacement.
+Randomly order the rows and give a unique (per class)
+identifier to each one.
+Select rows that have identifiers under the target limit.
+"""
+_undersampling_with_no_replacement(source_table, output_table, 
class_col,
+class_sizes, output_table_size, grouping_cols, 
with_replacement,
+class_counts, source_table_columns)
+
+_delete_temp_views(temp_views)
+return
+
+"""
+Create views for true and desired sample sizes of classes
+"""
+"""
+include_unsampled_classes tracks is unsampled classes are 
desired or not.
+include_unsampled_classes is always true in output_table_size 
Null cases but changes given values of desired sample class sizes in 
comma-delimited classsize paramter.
+"""
+include_unsampled_classes = True
+sampling_with_comma_delimited_class_sizes = class_sizes.find(':') 
> 0
+
+if sampling_with_comma_delimited_class_sizes:
+"""
+Compute sample sizes based on
+comman-delimited list of class_sizes
+and/or output_table_size
+"""
+class_sizes, include_unsampled_classes =

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

2018-01-16 Thread orhankislal

Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/223#discussion_r161865238
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,994 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import math
+import plpy
+import re
+from collections import defaultdict
+from fractions import Fraction
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import unique_string
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+from utilities.validate_args import get_cols
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`')
+
+def balance_sample(schema_madlib, source_table, output_table, class_col,
+class_sizes, output_table_size, grouping_cols, with_replacement, 
**kwargs):
+
+"""
+Balance sampling function
+Args:
+@param source_table   Input table name.
+@param output_table   Output table name.
+@param class_col  Name of the column containing the class 
to be
+  balanced.
+@param class_size Parameter to define the size of the 
different
+  class values.
+@param output_table_size  Desired size of the output data set.
+@param grouping_cols  The columns columns that defines the 
grouping.
+@param with_replacement   The sampling method.
+
+"""
+with MinWarning("warning"):
+
+class_counts = unique_string(desp='class_counts')
+desired_sample_per_class = 
unique_string(desp='desired_sample_per_class')
+desired_counts = unique_string(desp='desired_counts')
+
+if not class_sizes or class_sizes.strip().lower() in ('null', ''):
+class_sizes = 'uniform'
+
+_validate_strs(source_table, output_table, class_col, class_sizes,
+output_table_size, grouping_cols, with_replacement)
+
+source_table_columns = ','.join(get_cols(source_table))
+grp_by = "GROUP BY {0}".format(class_col)
+
+_create_frequency_distribution(class_counts, source_table, 
class_col)
+temp_views = [class_counts]
+
+if class_sizes.lower() == 'undersample' and not with_replacement:
+"""
+Random undersample without replacement.
+Randomly order the rows and give a unique (per class)
+identifier to each one.
+Select rows that have identifiers under the target limit.
+"""
+_undersampling_with_no_replacement(source_table, output_table, 
class_col,
+class_sizes, output_table_size, grouping_cols, 
with_replacement,
+class_counts, source_table_columns)
+
+_delete_temp_views(temp_views)
+return
+
+"""
+Create views for true and desired sample sizes of classes
+"""
+"""
+include_unsampled_classes tracks is unsampled classes are 
desired or not.
+include_unsampled_classes is always true in output_table_size 
Null cases but changes given values of desired sample class sizes in 
comma-delimited classsize paramter.
+"""
+include_unsampled_classes = True
+sampling_with_comma_delimited_class_sizes = class_sizes.find(':') 
> 0
+
+if sampling_with_comma_delimited_class_sizes:
+"""
+Compute sample sizes based on
+comman-delimited list of class_sizes
+and/or output_table_size
+"""
+class_sizes, include_unsampled_classes =

[GitHub] madlib pull request #225: Added option for weighted average for both classif...

[GitHub] madlib pull request #225: Added option for weighted average for both classif...

[GitHub] madlib issue #226: Update MADlib version to dev

Re: MADlib 1.13 community call, Jan 17 @ 1100 PST /1900 GMT

[GitHub] madlib issue #225: Added option for weighted average for both classification...

[GitHub] madlib pull request #225: Added option for weighted average for both classif...

[GitHub] madlib issue #223: Balance datasets : re-sampling technique

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

17 matches

Site Navigation

Mail list logo

Footer information