Github user jingyimei commented on a diff in the pull request:
https://github.com/apache/madlib/pull/234#discussion_r168891033
--- Diff: src/ports/postgres/modules/utilities/encode_categorical.py_in ---
@@ -317,7 +317,19 @@ class CategoricalEncoder(object):
if self.output_type not in ('array', 'svec'):
if not self._output_dictionary:
- value_names = {None: 'NULL',
+ ## MADLIB-1202
+ ## In postgres, boolean variables are always saved
+ ## as 'True', 'False' with the first letter as capital,
+ ## which will cause the generated column name as
+ ## <boolean column name>_True/False that needs double
+ ## quoting to query. To make it more convnient to user,
+ ## we cast them to lower case true/false so that the
+ ## generated column name is <boolean column
name>_true/false
+ ## The same logic applied to _null and _misc strs
+ if v in ('True', 'False'):
--- End diff --
We can't do isinstance(v, bool) here since v is either None or a string.
In method build_output_table(), distinct_values(which is a list of string and
None) got passed to self._build_encoding_str() method, and
self._build_encoding_str() will take the string value.
We decided to take another approach here. In self._get_distinct_values()
and self._get_top_values(),we cast the column value to text in query and
postgres will cast boolean column value 'True/False' to lowercase 'true/false'
string while keeping the same format for text column. In this case,
distinct_value will get lowercase true/false for boolean column and keep
original format of other text column.
Another thing to mention here is, since the output dictionary table also
relies on whatever value from list distinct_values, if a user specifies
creating dictionary tables, he will get lowercase 'true/false' for the boolean
column in output dictionary table after this commit.
---