[ 
https://issues.apache.org/jira/browse/MADLIB-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367980#comment-16367980
 ] 

ASF GitHub Bot commented on MADLIB-1202:
----------------------------------------

Github user jingyimei commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/234#discussion_r168891033
  
    --- Diff: src/ports/postgres/modules/utilities/encode_categorical.py_in ---
    @@ -317,7 +317,19 @@ class CategoricalEncoder(object):
     
                 if self.output_type not in ('array', 'svec'):
                     if not self._output_dictionary:
    -                    value_names = {None: 'NULL',
    +                    ## MADLIB-1202
    +                    ## In postgres, boolean variables are always saved
    +                    ## as 'True', 'False' with the first letter as capital,
    +                    ## which will cause the generated column name as
    +                    ## <boolean column name>_True/False that needs double
    +                    ## quoting to query. To make it more convnient to user,
    +                    ## we cast them to lower case true/false so that the
    +                    ## generated column name is <boolean column 
name>_true/false
    +                    ## The same logic applied to _null and _misc strs
    +                    if v in ('True', 'False'):
    --- End diff --
    
    We can't do isinstance(v, bool) here since v is either None or a string.  
In method build_output_table(), distinct_values(which is a list of string and 
None) got passed to self._build_encoding_str() method, and 
self._build_encoding_str() will take the string value. 
    
    We decided to take another approach here. In self._get_distinct_values() 
and self._get_top_values(),we cast the column value to text in query and 
postgres will cast boolean column value 'True/False' to lowercase 'true/false' 
string while keeping the same format for text column. In this case, 
distinct_value will get lowercase true/false for boolean column and keep 
original format of other text column.
    
    Another thing to mention here is, since the output dictionary table also 
relies on whatever value from list distinct_values, if a user specifies 
creating dictionary tables, he will get lowercase 'true/false' for the boolean 
column in output dictionary table after this commit. 


> encode_categorical_variables() creates all lower case column names for 
> boolean columns
> --------------------------------------------------------------------------------------
>
>                 Key: MADLIB-1202
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1202
>             Project: Apache MADlib
>          Issue Type: Improvement
>            Reporter: Jarrod Vawdrey
>            Assignee: Jingyi Mei
>            Priority: Minor
>             Fix For: v1.14
>
>
>  
> It would be handy if encode_categorical_variables() created lower case column 
> names for boolean columns vs upper case that require double quoting to query. 
> Current implementation generates "<boolean column name>_True" and "<boolean 
> column name>_False".
> Improvement to generate <boolean column name>_true and <boolean column 
> name>_false.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to