Hi Deron, As the error said "A column can not be binned and scaled.", no column can be subjected to both *binning* and *scaling *because it does not make sense. *Binning* turns a scale column with continuous values into a categorical column. On the other hand, *Scaling* can only be done on continuous values.
The error *does not *mean that *Scaling* is not supported. We do support S *caling*. At some point, I wanted to add the following table (which is currently present in Java code as comments) to our documentation to indicate transformations that can be used *simultaneously* on a single column. While you are at it, could you make sure it is added to the documentation? x indicates the combination is invalid. * indicates the combination is allowed. - indicates the combination is not applicable. OMIT MVI RCD BIN DCD SCL OMIT - x * * * * MVI x - * * * * RCD * * - x * x BIN * * x - * x DCD * * * * - x SCL * * x x x - OMIT = Missing value handling by *omitting *rows MVI = Missing value handling by *imputation* RCD = Recoding BIN = Binning DCD = Dummycoding SCL = Scaling Let me know if you have any further questions. Thank you, Shirish On Wed, Dec 9, 2015 at 4:53 PM, Deron Eriksson <[email protected]> wrote: > Hi, > > I'm working on updating the online docs for the DML transform() function > since a couple things didn't copy over in the conversion to markdown. > However, I've run into an issue when I execute the transform() example. In > summary, is the "scale" transformation no longer allowed, and "bin" is > allowed? > > I did the following: > > I created data.csv: > > > zipcode,district,sqft,numbedrooms,numbathrooms,floors,view,saleprice,askingprice > 95141,south,3002,6,3,2,FALSE,929,934 > NA,west,1373,,1,3,FALSE,695,698 > 91312,south,NA,6,2,2,FALSE,902, > 94555,NA,1835,3,,3,,888,892 > 95141,west,2770,5,2.5,,TRUE,812,816 > 95141,east,2833,6,2.5,2,TRUE,927, > 96334,NA,1339,6,3,1,FALSE,672,675 > 96334,south,2742,6,2.5,2,FALSE,872,876 > 96334,north,2195,5,2.5,2,FALSE,799,803 > > I created data.csv.mtd: > > { > "data_type": "frame", > "format": "csv", > "sep": ",", > "header": true, > "na.strings": [ "NA", "" ] > } > > I created data.spec.json: > > { > "omit": [ "zipcode" ] > ,"impute": > [ { "name": "district" , "method": "constant", "value": "south" } > ,{ "name": "numbedrooms" , "method": "constant", "value": 2 } > ,{ "name": "numbathrooms", "method": "constant", "value": 1 } > ,{ "name": "floors" , "method": "constant", "value": 1 } > ,{ "name": "view" , "method": "global_mode" } > ,{ "name": "askingprice" , "method": "global_mean" } > ] > > ,"recode": > [ "zipcode", "district", "numbedrooms", "numbathrooms", "floors", > "view" ] > > ,"bin": > [ { "name": "saleprice" , "method": "equi-width", "numbins": 3 } > ,{ "name": "sqft" , "method": "equi-width", "numbins": 4 } > ] > > ,"dummycode": > [ "district", "numbathrooms", "floors", "view", "saleprice", "sqft" ] > > ,"scale": > [ { "name": "sqft", "method": "mean-subtraction" } > ,{ "name": "saleprice", "method": "z-score" } > ,{ "name": "askingprice", "method": "z-score" } > ] > } > > I executed the following DML: > > D = read("data.csv"); > tfD = transform(target=D, > transformSpec="data.spec.json", > transformPath="example-transform"); > s = sum(tfD); > print("Sum = " + s); > > This generated the following error: > > java.lang.IllegalArgumentException: Invalid transformations on column ID 3. > A column can not be binned and scaled. > > So, I removed the "scale" from data.spec.json: > > { > "omit": [ "zipcode" ] > ,"impute": > [ { "name": "district" , "method": "constant", "value": "south" } > ,{ "name": "numbedrooms" , "method": "constant", "value": 2 } > ,{ "name": "numbathrooms", "method": "constant", "value": 1 } > ,{ "name": "floors" , "method": "constant", "value": 1 } > ,{ "name": "view" , "method": "global_mode" } > ,{ "name": "askingprice" , "method": "global_mean" } > ] > > ,"recode": > [ "zipcode", "district", "numbedrooms", "numbathrooms", "floors", > "view" ] > > ,"bin": > [ { "name": "saleprice" , "method": "equi-width", "numbins": 3 } > ,{ "name": "sqft" , "method": "equi-width", "numbins": 4 } > ] > > ,"dummycode": > [ "district", "numbathrooms", "floors", "view", "saleprice", "sqft" ] > > } > > This generated: > > java.lang.RuntimeException: Encountered "NA" in column ID "3", when > expecting a numeric value. Consider adding "NA" to na.strings, along with > an appropriate imputation method. > > So, I set "sqft" to be "global_mean" in the "impute" section of the spec. > > { > "omit": [ "zipcode" ] > ,"impute": > [ { "name": "district" , "method": "constant", "value": "south" } > ,{ "name": "numbedrooms" , "method": "constant", "value": 2 } > ,{ "name": "numbathrooms", "method": "constant", "value": 1 } > ,{ "name": "floors" , "method": "constant", "value": 1 } > ,{ "name": "view" , "method": "global_mode" } > ,{ "name": "askingprice" , "method": "global_mean" } > ,{ "name": "sqft" , "method": "global_mean" } > ] > > ,"recode": > [ "zipcode", "district", "numbedrooms", "numbathrooms", "floors", > "view" ] > > ,"bin": > [ { "name": "saleprice" , "method": "equi-width", "numbins": 3 } > ,{ "name": "sqft" , "method": "equi-width", "numbins": 4 } > ] > > ,"dummycode": > [ "district", "numbathrooms", "floors", "view", "saleprice", "sqft" ] > > } > > This allowed the DML to execute successfully. > > So, is "scale" not allowed anymore? And "bin" is allowed (despite the > message saying it isn't allowed)? > > Thank you, > Deron >
