Hi,
I'm working on updating the online docs for the DML transform() function
since a couple things didn't copy over in the conversion to markdown.
However, I've run into an issue when I execute the transform() example. In
summary, is the "scale" transformation no longer allowed, and "bin" is
allowed?
I did the following:
I created data.csv:
zipcode,district,sqft,numbedrooms,numbathrooms,floors,view,saleprice,askingprice
95141,south,3002,6,3,2,FALSE,929,934
NA,west,1373,,1,3,FALSE,695,698
91312,south,NA,6,2,2,FALSE,902,
94555,NA,1835,3,,3,,888,892
95141,west,2770,5,2.5,,TRUE,812,816
95141,east,2833,6,2.5,2,TRUE,927,
96334,NA,1339,6,3,1,FALSE,672,675
96334,south,2742,6,2.5,2,FALSE,872,876
96334,north,2195,5,2.5,2,FALSE,799,803
I created data.csv.mtd:
{
"data_type": "frame",
"format": "csv",
"sep": ",",
"header": true,
"na.strings": [ "NA", "" ]
}
I created data.spec.json:
{
"omit": [ "zipcode" ]
,"impute":
[ { "name": "district" , "method": "constant", "value": "south" }
,{ "name": "numbedrooms" , "method": "constant", "value": 2 }
,{ "name": "numbathrooms", "method": "constant", "value": 1 }
,{ "name": "floors" , "method": "constant", "value": 1 }
,{ "name": "view" , "method": "global_mode" }
,{ "name": "askingprice" , "method": "global_mean" }
]
,"recode":
[ "zipcode", "district", "numbedrooms", "numbathrooms", "floors",
"view" ]
,"bin":
[ { "name": "saleprice" , "method": "equi-width", "numbins": 3 }
,{ "name": "sqft" , "method": "equi-width", "numbins": 4 }
]
,"dummycode":
[ "district", "numbathrooms", "floors", "view", "saleprice", "sqft" ]
,"scale":
[ { "name": "sqft", "method": "mean-subtraction" }
,{ "name": "saleprice", "method": "z-score" }
,{ "name": "askingprice", "method": "z-score" }
]
}
I executed the following DML:
D = read("data.csv");
tfD = transform(target=D,
transformSpec="data.spec.json",
transformPath="example-transform");
s = sum(tfD);
print("Sum = " + s);
This generated the following error:
java.lang.IllegalArgumentException: Invalid transformations on column ID 3.
A column can not be binned and scaled.
So, I removed the "scale" from data.spec.json:
{
"omit": [ "zipcode" ]
,"impute":
[ { "name": "district" , "method": "constant", "value": "south" }
,{ "name": "numbedrooms" , "method": "constant", "value": 2 }
,{ "name": "numbathrooms", "method": "constant", "value": 1 }
,{ "name": "floors" , "method": "constant", "value": 1 }
,{ "name": "view" , "method": "global_mode" }
,{ "name": "askingprice" , "method": "global_mean" }
]
,"recode":
[ "zipcode", "district", "numbedrooms", "numbathrooms", "floors",
"view" ]
,"bin":
[ { "name": "saleprice" , "method": "equi-width", "numbins": 3 }
,{ "name": "sqft" , "method": "equi-width", "numbins": 4 }
]
,"dummycode":
[ "district", "numbathrooms", "floors", "view", "saleprice", "sqft" ]
}
This generated:
java.lang.RuntimeException: Encountered "NA" in column ID "3", when
expecting a numeric value. Consider adding "NA" to na.strings, along with
an appropriate imputation method.
So, I set "sqft" to be "global_mean" in the "impute" section of the spec.
{
"omit": [ "zipcode" ]
,"impute":
[ { "name": "district" , "method": "constant", "value": "south" }
,{ "name": "numbedrooms" , "method": "constant", "value": 2 }
,{ "name": "numbathrooms", "method": "constant", "value": 1 }
,{ "name": "floors" , "method": "constant", "value": 1 }
,{ "name": "view" , "method": "global_mode" }
,{ "name": "askingprice" , "method": "global_mean" }
,{ "name": "sqft" , "method": "global_mean" }
]
,"recode":
[ "zipcode", "district", "numbedrooms", "numbathrooms", "floors",
"view" ]
,"bin":
[ { "name": "saleprice" , "method": "equi-width", "numbins": 3 }
,{ "name": "sqft" , "method": "equi-width", "numbins": 4 }
]
,"dummycode":
[ "district", "numbathrooms", "floors", "view", "saleprice", "sqft" ]
}
This allowed the DML to execute successfully.
So, is "scale" not allowed anymore? And "bin" is allowed (despite the
message saying it isn't allowed)?
Thank you,
Deron