Re: DML transform() function

Shirish Tatikonda Wed, 09 Dec 2015 18:00:03 -0800

Hi Deron,

As the error said "A column can not be binned and scaled.", no column can
be subjected to both *binning* and *scaling *because it does not make
sense. *Binning* turns a scale column with continuous values into a
categorical column. On the other hand, *Scaling* can only be done on
continuous values.


The error *does not *mean that *Scaling* is not supported. We do support S
*caling*.

At some point, I wanted to add the following table (which is currently
present in Java code as comments) to our documentation to indicate
transformations that can be used *simultaneously* on a single column. While
you are at it, could you make sure it is added to the documentation?

x indicates the combination is invalid.
* indicates the combination is allowed.
- indicates the combination is not applicable.

  OMIT MVI RCD BIN DCD SCL
OMIT     -  x   *   *   *   *
MVI      x  -   *   *   *   *
RCD      *  *   -   x   *   x
BIN      *  *   x   -   *   x
DCD      *  *   *   *   -   x
SCL      *  *   x   x   x   -

OMIT = Missing value handling by *omitting *rows
MVI  = Missing value handling by *imputation*
RCD  = Recoding
BIN  = Binning
DCD  = Dummycoding
SCL  = Scaling

Let me know if you have any further questions.

Thank you,
Shirish


On Wed, Dec 9, 2015 at 4:53 PM, Deron Eriksson <[email protected]>
wrote:

> Hi,
>
> I'm working on updating the online docs for the DML transform() function
> since a couple things didn't copy over in the conversion to markdown.
> However, I've run into an issue when I execute the transform() example. In
> summary, is the "scale" transformation no longer allowed, and "bin" is
> allowed?
>
> I did the following:
>
> I created data.csv:
>
>
> zipcode,district,sqft,numbedrooms,numbathrooms,floors,view,saleprice,askingprice
> 95141,south,3002,6,3,2,FALSE,929,934
> NA,west,1373,,1,3,FALSE,695,698
> 91312,south,NA,6,2,2,FALSE,902,
> 94555,NA,1835,3,,3,,888,892
> 95141,west,2770,5,2.5,,TRUE,812,816
> 95141,east,2833,6,2.5,2,TRUE,927,
> 96334,NA,1339,6,3,1,FALSE,672,675
> 96334,south,2742,6,2.5,2,FALSE,872,876
> 96334,north,2195,5,2.5,2,FALSE,799,803
>
> I created data.csv.mtd:
>
> {
>     "data_type": "frame",
>     "format": "csv",
>     "sep": ",",
>     "header": true,
>     "na.strings": [ "NA", "" ]
> }
>
> I created data.spec.json:
>
> {
>     "omit": [ "zipcode" ]
>    ,"impute":
>     [ { "name": "district"    , "method": "constant", "value": "south" }
>      ,{ "name": "numbedrooms" , "method": "constant", "value": 2 }
>      ,{ "name": "numbathrooms", "method": "constant", "value": 1 }
>      ,{ "name": "floors"      , "method": "constant", "value": 1 }
>      ,{ "name": "view"        , "method": "global_mode" }
>      ,{ "name": "askingprice" , "method": "global_mean" }
>     ]
>
>     ,"recode":
>     [ "zipcode", "district", "numbedrooms", "numbathrooms", "floors",
> "view" ]
>
>     ,"bin":
>     [ { "name": "saleprice"  , "method": "equi-width", "numbins": 3 }
>      ,{ "name": "sqft"       , "method": "equi-width", "numbins": 4 }
>     ]
>
>     ,"dummycode":
>     [ "district", "numbathrooms", "floors", "view", "saleprice", "sqft" ]
>
>     ,"scale":
>     [ { "name": "sqft", "method": "mean-subtraction" }
>      ,{ "name": "saleprice", "method": "z-score" }
>      ,{ "name": "askingprice", "method": "z-score" }
>     ]
> }
>
> I executed the following DML:
>
> D = read("data.csv");
> tfD = transform(target=D,
>                 transformSpec="data.spec.json",
>                 transformPath="example-transform");
> s = sum(tfD);
> print("Sum = " + s);
>
> This generated the following error:
>
> java.lang.IllegalArgumentException: Invalid transformations on column ID 3.
> A column can not be binned and scaled.
>
> So, I removed the "scale" from data.spec.json:
>
> {
>     "omit": [ "zipcode" ]
>    ,"impute":
>     [ { "name": "district"    , "method": "constant", "value": "south" }
>      ,{ "name": "numbedrooms" , "method": "constant", "value": 2 }
>      ,{ "name": "numbathrooms", "method": "constant", "value": 1 }
>      ,{ "name": "floors"      , "method": "constant", "value": 1 }
>      ,{ "name": "view"        , "method": "global_mode" }
>      ,{ "name": "askingprice" , "method": "global_mean" }
>     ]
>
>     ,"recode":
>     [ "zipcode", "district", "numbedrooms", "numbathrooms", "floors",
> "view" ]
>
>     ,"bin":
>     [ { "name": "saleprice"  , "method": "equi-width", "numbins": 3 }
>      ,{ "name": "sqft"       , "method": "equi-width", "numbins": 4 }
>     ]
>
>     ,"dummycode":
>     [ "district", "numbathrooms", "floors", "view", "saleprice", "sqft" ]
>
> }
>
> This generated:
>
> java.lang.RuntimeException: Encountered "NA" in column ID "3", when
> expecting a numeric value. Consider adding "NA" to na.strings, along with
> an appropriate imputation method.
>
> So, I set "sqft" to be "global_mean" in the "impute" section of the spec.
>
> {
>     "omit": [ "zipcode" ]
>    ,"impute":
>     [ { "name": "district"    , "method": "constant", "value": "south" }
>      ,{ "name": "numbedrooms" , "method": "constant", "value": 2 }
>      ,{ "name": "numbathrooms", "method": "constant", "value": 1 }
>      ,{ "name": "floors"      , "method": "constant", "value": 1 }
>      ,{ "name": "view"        , "method": "global_mode" }
>      ,{ "name": "askingprice" , "method": "global_mean" }
>      ,{ "name": "sqft"        , "method": "global_mean" }
>     ]
>
>     ,"recode":
>     [ "zipcode", "district", "numbedrooms", "numbathrooms", "floors",
> "view" ]
>
>     ,"bin":
>     [ { "name": "saleprice"  , "method": "equi-width", "numbins": 3 }
>      ,{ "name": "sqft"       , "method": "equi-width", "numbins": 4 }
>     ]
>
>     ,"dummycode":
>     [ "district", "numbathrooms", "floors", "view", "saleprice", "sqft" ]
>
> }
>
> This allowed the DML to execute successfully.
>
> So, is "scale" not allowed anymore? And "bin" is allowed (despite the
> message saying it isn't allowed)?
>
> Thank you,
> Deron
>

Re: DML transform() function

Reply via email to