Baunsgaard commented on a change in pull request #1323:
URL: https://github.com/apache/systemds/pull/1323#discussion_r667795868



##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -386,51 +386,35 @@ def test_level2(self):
 
         """""
         
################################################################################################################
-        X1, M1 = X1.transform_encode(spec=jspec).compute()
+        X1, M1 = X1.transform_encode(spec=jspec)
 
         
################################################################################################################
         """"
-        First we re-split out data into a training and a test set with the 
corresponding labels. We can then simply transform
-        the numpy array of the training data back to SystemDS matrix by using 
"sds.from_numpy()". 
-        The SystemDS scale function takes a matrix as an input and returns 
three output parameters:
-            # Y            Matrix    ---      Output feature matrix with K 
columns
-            # ColMean      Matrix    ---      The column means of the input, 
subtracted if Center was TRUE
-            # ScaleFactor  Matrix    ---      The Scaling of the values, to 
make each dimension have similar value ranges
-        If we want to retransform a SystemDs Matrix to a Numpy array we can do 
so by using the np.array() function. 
+        First we re-split out data into a training and a test set with the 
corresponding labels. 
         """""
         
################################################################################################################
-        col_length = len(X1[0])
-        X = X1[0:train_count, 0:col_length - 1]
-        Y = X1[0:train_count, col_length - 1:col_length].flatten()
-        # Test data
-        Xt = X1[train_count:train_count + test_count, 0:col_length - 1]
-        Yt = X1[train_count:train_count + test_count, col_length - 
1:col_length].flatten()
+        PREPROCESS_package = self.sds.source(self.preprocess_src_path, 
"preprocess", print_imported_methods=True)
 
+        X = PREPROCESS_package.get_X(X1, train_count)
+        Y = PREPROCESS_package.get_Y(X1, train_count)
+        #We lose the column count information after using the Preprocess 
Package. This triggers an error on multilogregpredict. Otherwise its working
+        Xt = self.sds.from_numpy(np.array(PREPROCESS_package.get_Xt(X1, 
train_count).compute()))

Review comment:
       if i understand correctly.
   
   after you make anything into a matrix in the system like X and Y, you don't 
know how many columns and rows there is.
   this is correct, since materializing the column and row count in the python 
API would require us to do processing, that we only evaluate after compute.
   once you have the result back from compute you should have the correct 
number of columns and rows in numpy, but all intermediates you should not know.

##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -387,6 +387,11 @@ def test_level2(self):
         
################################################################################################################
         X1, M1 = X1.transform_encode(spec=jspec)
 
+        # better alternative for encoding
+        # X1, M = F1.transform_encode(spec=jspec)
+        # X2 = F2.transform_apply(spec=jspec, meta=M)
+        # testX2 = X2.compute(True)

Review comment:
       two things here.
   
   1. Yes, if there is a different number of labels in the training vs the test 
data, then the transform_apply does not work, currently there is no way around 
this. Do you really have 50k different classes in one of the features? I think 
you might be using a wrong encoding scheme for some of the columns.
   2. Frames should be supported, but they are very new so there are bound to 
be bugs, the function definitions should specify frame if the input type is 
frame otherwise you should not call a function with frames, could you tell me 
which function you are trying to call, then i can try to fix it if there is a 
bug?

##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -387,6 +387,11 @@ def test_level2(self):
         
################################################################################################################
         X1, M1 = X1.transform_encode(spec=jspec)
 
+        # better alternative for encoding
+        # X1, M = F1.transform_encode(spec=jspec)
+        # X2 = F2.transform_apply(spec=jspec, meta=M)
+        # testX2 = X2.compute(True)

Review comment:
       1. Okay now i understand the problem... a classic ... "someone made an 
error when making the dataset"...
   2. Since you have this issue i just extended the frame to support the 
replace operation simply use
   
   replace(target=X, pattern="<=50K.", replacement="<=50K")
   

##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -387,6 +387,11 @@ def test_level2(self):
         
################################################################################################################
         X1, M1 = X1.transform_encode(spec=jspec)
 
+        # better alternative for encoding
+        # X1, M = F1.transform_encode(spec=jspec)
+        # X2 = F2.transform_apply(spec=jspec, meta=M)
+        # testX2 = X2.compute(True)

Review comment:
       well i guess, i did not add it to the python API... will do.
   

##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -387,6 +387,11 @@ def test_level2(self):
         
################################################################################################################
         X1, M1 = X1.transform_encode(spec=jspec)
 
+        # better alternative for encoding
+        # X1, M = F1.transform_encode(spec=jspec)
+        # X2 = F2.transform_apply(spec=jspec, meta=M)
+        # testX2 = X2.compute(True)

Review comment:
       Should be there now ... if you have a matrix or a frame simply call 
.replace("target","pattern")




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@systemds.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to