Baunsgaard commented on pull request #1195: URL: https://github.com/apache/systemds/pull/1195#issuecomment-788323417
> * Please leave the `Y = ifelse(Y <= 0, max_y + 1, Y)` as it is (converting all <=0 labels to a new maximum label). That way it naturally applied to binary classification where labels are often given as -1 and 1 > * Personally, I would remove the support for one-hot encoded labels y in order to reduce the surface of the API. If kept, please move the checks for min/max y inside the respective branch I think this is not intuitive, and should result in an error, where ppl should preprocess instead of relying on the function to fix it. It is frustrating is that the 0'th label suddenly becomes the last label. this confused me with MNIST. also this current implementation, and the main reason why i'm playing around with this, can contain more or less labels when calling predict. If the dateset contains classes 1-5 but because of some reason the training data is only containing 1-4. then the model calling table creates values to classify into 4 labels, rather than the 5 intended originally. This sometimes happens if you slice in covtype, where sometime the last label is missing. More common is in the predict call where if we call with a single data point, lets say label 3 then this same way of handling Y would not work because we end up with 3 columns instead of 5. The predict can be fixed by looking at the other arguments to the function but i would rather have the Y input match the input to training. The most intuitive API here for me is: if 1 column -> onehot encode (don't change the negative or 0 values but stop execution if they are there) if more than one column -> assume one hot encoded and verify that all values are 0 or 1. ideally i would make this default and not allow 1 column inputs. > * Remove the call `LT = decompress(LT)` (that is exactly why we should disable its use at script level other than in tests) Ups, debugging code. "classic" > * The isNaN(X) returns an empty block if there are no NaN (without allocation) - so it's overhead is actually very small. The sum can be faster due to multi-threading, but if that is true we might want to introduce the rewrite `sum(isNaN(X))!=0 --> isNan(sum(X))` instead of modifying such scripts. sure, i mentioned this last meeting, and you said the same. I double checked the code... and you are right. it should not matter in the case where there is no NaN values. if there is some, then the rewrite is faster, but then the execution will crash anyway. With the rewrite i could avoid supporting Unary operations in compression, but i guess i will just have to add it. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
