I agree that we should discuss it here more widely 1) Could Label be not double value? (String, for example) 2) Should we extend Encoding for non-Double labels (if we work with non-double values)? 3) Should we validate and reject non-double values on trainers level? (I agree that a lot of double casting is ugly)
>From my point of view, we should explore scikit-learn and Spark ML about this issues and we shoould 1) support all types in labels and fix things described above by Ravil or 2) remove strange generics and hard-code work with double without casting and etc. and declare our position in documentation First approach costs a lot of time, agree. вт, 11 июн. 2019 г. в 00:29, Ravil Galeyev <[email protected]>: > Hi Team, > > I tried to run Ignite ML across the dataset with categorical features and > came across some problems. > > My dataset is Mushrooms > <https://www.kaggle.com/uciml/mushroom-classification> dataset from > Kaggle. > There are only categorial features and categorical labels. > > (so-called classification problem). My attempt you can find in my repo > < > https://github.com/dehasi/mushrooms/blob/master/src/main/java/me/dehasi/mushrooms/MushroomsMain.java > > > . > > My goal is to make a pipeline which takes raw string values, encodes them > to numbers, then train a model. > > The first problem is the Vectorizer. > > I started with DummyVectorizer but it supports only Double labels. > > All other vectorizers have the same issue because all of them are inherited > > from DefaultLabelVectorizer > < > https://github.com/apache/ignite/blob/master/modules/ml/src/main/java/org/apache/ignite/ml/dataset/feature/extractor/ExtractionUtils.java#L36 > > > where Double labels are hardcoded at the generic level. > > I didn’t find an approach to work with only categorical data with standard > Ignite vectorizers. I wrote my own. > > The second problem. EncoderTrainer (in my case STRING_ENCODER). > > It doesn’t encode labels. The trainer just ignores labels. See > EncoderTrainer > < > https://github.com/apache/ignite/blob/master/modules/ml/src/main/java/org/apache/ignite/ml/preprocessing/encoding/EncoderTrainer.java#L169 > > > . > > Probably ignoring labels makes sense, but… > > The third problem. ClassCastException. > > There are “hidden” (for user) casts labels to Double in model trainers > > i.e. SVMLinearClassificationTrainer > < > https://github.com/apache/ignite/blob/master/modules/ml/src/main/java/org/apache/ignite/ml/svm/SVMLinearClassificationTrainer.java#L191 > >, > DiscreteNaiveBayesTrainer etc. > > Feel free to use my regex \(Double\).*\.label\(\) to search other casts. > > To sum up, I can say that there are assumptions that labels are numeric > values, > > but if we solve a classification problem, labels can be whatever. > > But I didn’t find an easy way to preprocess them. > > > > If you have any question or need details, feel free to write to me. > > Best regards, > > Ravil >
