[GitHub] madlib pull request #234: Create lower case column name in encode_categorica...

jingyimei Fri, 16 Feb 2018 14:56:19 -0800

Github user jingyimei commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/234#discussion_r168891033
  
    --- Diff: src/ports/postgres/modules/utilities/encode_categorical.py_in ---
    @@ -317,7 +317,19 @@ class CategoricalEncoder(object):
     
                 if self.output_type not in ('array', 'svec'):
                     if not self._output_dictionary:
    -                    value_names = {None: 'NULL',
    +                    ## MADLIB-1202
    +                    ## In postgres, boolean variables are always saved
    +                    ## as 'True', 'False' with the first letter as capital,
    +                    ## which will cause the generated column name as
    +                    ## <boolean column name>_True/False that needs double
    +                    ## quoting to query. To make it more convnient to user,
    +                    ## we cast them to lower case true/false so that the
    +                    ## generated column name is <boolean column 
name>_true/false
    +                    ## The same logic applied to _null and _misc strs
    +                    if v in ('True', 'False'):
    --- End diff --
    
    We can't do isinstance(v, bool) here since v is either None or a string.  
In method build_output_table(), distinct_values(which is a list of string and 
None) got passed to self._build_encoding_str() method, and 
self._build_encoding_str() will take the string value. 
    
    We decided to take another approach here. In self._get_distinct_values() 
and self._get_top_values(),we cast the column value to text in query and 
postgres will cast boolean column value 'True/False' to lowercase 'true/false' 
string while keeping the same format for text column. In this case, 
distinct_value will get lowercase true/false for boolean column and keep 
original format of other text column.
    
    Another thing to mention here is, since the output dictionary table also 
relies on whatever value from list distinct_values, if a user specifies 
creating dictionary tables, he will get lowercase 'true/false' for the boolean 
column in output dictionary table after this commit.

---

[GitHub] madlib pull request #234: Create lower case column name in encode_categorica...

Reply via email to