I agree that we should discuss it here more widely

1) Could Label be not double value? (String, for example)
2) Should we extend Encoding for non-Double labels (if we work with
non-double values)?
3) Should we validate and reject non-double values on trainers level? (I
agree that a lot of double casting is ugly)

>From my point of view, we should explore scikit-learn and Spark ML about
this issues and we shoould
1) support all types in labels and fix things described above by Ravil
or
2) remove strange generics and hard-code work with double without casting
and etc. and declare our position in documentation

First approach costs a lot of time, agree.



вт, 11 июн. 2019 г. в 00:29, Ravil Galeyev <[email protected]>:

> Hi Team,
>
> I tried to run Ignite ML across the dataset with categorical features and
> came across some problems.
>
> My dataset is Mushrooms
> <https://www.kaggle.com/uciml/mushroom-classification> dataset from
> Kaggle.
> There are only categorial features and categorical labels.
>
> (so-called classification problem). My attempt you can find in my repo
> <
> https://github.com/dehasi/mushrooms/blob/master/src/main/java/me/dehasi/mushrooms/MushroomsMain.java
> >
> .
>
> My goal is to make a pipeline which takes raw string values, encodes them
> to numbers, then train a model.
>
> The first problem is the Vectorizer.
>
> I started with DummyVectorizer but it supports only Double labels.
>
> All other vectorizers have the same issue because all of them are inherited
>
> from DefaultLabelVectorizer
> <
> https://github.com/apache/ignite/blob/master/modules/ml/src/main/java/org/apache/ignite/ml/dataset/feature/extractor/ExtractionUtils.java#L36
> >
> where Double labels are hardcoded at the generic level.
>
> I didn’t find an approach to work with only categorical data with standard
> Ignite vectorizers. I wrote my own.
>
> The second problem. EncoderTrainer (in my case STRING_ENCODER).
>
> It doesn’t encode labels. The trainer just ignores labels. See
> EncoderTrainer
> <
> https://github.com/apache/ignite/blob/master/modules/ml/src/main/java/org/apache/ignite/ml/preprocessing/encoding/EncoderTrainer.java#L169
> >
> .
>
> Probably ignoring labels makes sense, but…
>
> The third problem. ClassCastException.
>
> There are “hidden” (for user) casts labels to Double in model trainers
>
> i.e. SVMLinearClassificationTrainer
> <
> https://github.com/apache/ignite/blob/master/modules/ml/src/main/java/org/apache/ignite/ml/svm/SVMLinearClassificationTrainer.java#L191
> >,
> DiscreteNaiveBayesTrainer etc.
>
> Feel free to use my regex \(Double\).*\.label\(\) to search other casts.
>
> To sum up, I can say that there are assumptions that labels are numeric
> values,
>
> but if we solve a classification problem, labels can be whatever.
>
> But I didn’t find an easy way to preprocess them.
>
>
>
> If you have any question or need details, feel free to write to me.
>
> Best regards,
>
> Ravil
>

Reply via email to