[GitHub] [systemds] Baunsgaard commented on pull request #1195: [SYSTEMDS-2880] MultiLogReg consistancy

GitBox Mon, 01 Mar 2021 13:51:24 -0800


Baunsgaard commented on pull request #1195:
URL: https://github.com/apache/systemds/pull/1195#issuecomment-788323417



   
   >     * Please leave the `Y = ifelse(Y <= 0, max_y + 1, Y)` as it is 
(converting all <=0 labels to a new maximum label). That way it naturally 
applied to binary classification where labels are often given as -1 and 1
   
   >     * Personally, I would remove the support for one-hot encoded labels y 
in order to reduce the surface of the API. If kept, please move the checks for 
min/max y inside the respective branch
   
   I think this is not intuitive, and should result in an error, where ppl 
should preprocess instead of relying on the function to fix it.
   It is frustrating is that the 0'th label suddenly becomes the last label. 
this confused me with MNIST.
   also this current implementation, and the main reason why i'm playing around 
with this, can contain more or less labels when calling predict.
   
   If the dateset contains classes 1-5 but because of some reason the training 
data is only containing 1-4. then the model calling table creates values to 
classify into 4 labels, rather than the 5 intended originally. This sometimes 
happens if you slice in covtype, where sometime the last label is missing.
   More common is in the predict call where if we call with a single data 
point, lets say label 3 then this same way of handling Y would not work because 
we end up with 3 columns instead of 5. The predict can be fixed by looking at 
the other arguments to the function but i would rather have the Y input match 
the input to training.
   
   The most intuitive API here for me is: 
   
   if 1 column -> onehot encode (don't change the negative or 0 values but stop 
execution if they are there)
   if more than one column -> assume one hot encoded and verify that all values 
are 0 or 1. ideally i would make this default and not allow 1 column inputs.
   
   >     * Remove the call `LT = decompress(LT)` (that is exactly why we should 
disable its use at script level other than in tests)
   
   Ups, debugging code. "classic"
   
   >     * The isNaN(X) returns an empty block if there are no NaN (without 
allocation) - so it's overhead is actually very small. The sum can be faster 
due to multi-threading, but if that is true we might want to introduce the 
rewrite `sum(isNaN(X))!=0 --> isNan(sum(X))` instead of modifying such scripts.
   
   sure, i mentioned this last meeting, and you said the same. I double checked 
the code... and you are right. it should not matter in the case where there is 
no NaN values. if there is some, then the rewrite is faster, but then the 
execution will crash anyway. With the rewrite i could avoid supporting Unary 
operations in compression, but i guess i will just have to add it.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [systemds] Baunsgaard commented on pull request #1195: [SYSTEMDS-2880] MultiLogReg consistancy

Reply via email to