On Wed, Nov 07, 2018 at 01:27:59PM -0500, Rick Hedin wrote: > Hi, Ryan. > > Hmm. Interesting. Always numeric, eh?
Yeah, mlpack is built on the Armadillo matrix library. I'm not familiar with CRM114 but I would imagine it is doing something that amounts to one-hot encoding. > I could convert our data into numeric values, as you suggest, but I have > some misgivings. Maybe I should say some more about the data records that > stream in. > > 1. Some of our fields are encoded values. So for example SERVICE_TYPE = > PSYCHIC_READING. With other possible values being PALM_READING, > CASTING_STICKS, TAROT_CARDS. (We don't really offer psychic readings.) > > We should be able to assign a number to each of the possible values without > any problem. Right. The typical way to handle something like this in a machine learning library would be one-hot encoding, where instead of having one dimension for SERVICE_TYPE, you'd have a dimension SERVICE_TYPE_PSYCHIC_READING that takes a value 0/1, SERVICE_TYPE_PALM_READING that takes value 0/1, etc., etc. for each possibility. This makes your data matrix pretty big though. There are a few mlpack algorithms like the decision tree (mlpack_decision_tree from the command line) that support categorical variables, which can be loaded with the .arff file format. > 2. Some of our fields are numeric values. So for example AMOUNT_CHARGED = > 9.95. > > I bet MLPack could handle these directly. Yep, no need to change these. > 3. Some of our fields are free text fields. For example COMMENT = > "Customer seemed agitated. I couldn't get a clear reading." > > We could create a big dictionary, and map words to numbers. Leaving out > stemming and phrases. But that would truly be a very big dictionary. And > it's quite likely that information in the comment might be useful for > determining whether the transaction is fraudulent. > > My previous experience, CRM114, handles text swimmingly. But it doesn't > handle numeric fields at all. (Other than as a very peculiar looking > number.) Perhaps neither engine is really appropriate. > > Does this information about our fields jog loose any additional ideas? Typically with text data, dictionary encoding is often how it's done. Alternately, it could be done at the character level so the dictionary is small, but then you need a powerful modeling technique to learn the different dependencies between letters. However I am not an NLP expert, so I can't say what will be best for your problem, but something kind of like dictionary encoding could work. word2vec could also be another interesting preprocessing technique, but I don't think anyone's currently implemented this model in mlpack. I think, my overall suggestion would be, once you can get your data into a numeric format, mlpack could be used just fine for the actual machine learning part, but mlpack doesn't have the best facilities for text loading and preprocessing. Hope this helps! -- Ryan Curtin | "And do not attempt to grow a brain!" [email protected] | - Sgt. Howard Payne _______________________________________________ mlpack mailing list [email protected] http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack
