[GitHub] [tvm] AndrewZhaoLuo commented on a diff in pull request #11479: [FIX] Pad feature vectors to the same size in xgboost cost model

GitBox Wed, 01 Jun 2022 12:59:57 -0700


AndrewZhaoLuo commented on code in PR #11479:
URL: https://github.com/apache/tvm/pull/11479#discussion_r887226091



##########
python/tvm/testing/autotvm.py:
##########
@@ -62,9 +62,12 @@ def matmul(N, L, M, dtype):
 
     # schedule according to config
     yo, yi = cfg["tile_y"].apply(s, C, y)
-    xo, xi = cfg["tile_x"].apply(s, C, x)
+    if cfg["tile_x"].size[-1] > 1:
+        xo, xi = cfg["tile_x"].apply(s, C, x)

Review Comment:
   Can you give some context on this in the code? Is this to hit the previously 
failing case?



##########
python/tvm/autotvm/tuner/xgboost_cost_model.py:
##########
@@ -243,18 +243,27 @@ def fit_log(self, records, plan_size, 
min_seed_records=500):
         else:
             raise RuntimeError("Invalid feature type: " + self.fea_type)
         result = pool.map_with_error_catching(feature_extract_func, data)
+        result = list(result) # store results so we can iterate through them 
twice
 
-        # filter out feature with different shapes
-        fea_len = len(self._get_feature([0])[0])
+        # get maximum feature length
+        fea_len = -1
+        for res in result:
+            if res.status != StatusKind.COMPLETE:
+                continue
+            x, _ = res.value
+            fea_len = max(fea_len, x.shape[0])
 
         xs, ys = [], []
         for res in result:
             if res.status != StatusKind.COMPLETE:
                 continue
             x, y = res.value
-            if len(x) == fea_len:

Review Comment:
   Hmm, I think previously glossed over the significance of this. Before, it 
would only use features from those with the same shape as the first entry to 
fit the model so basically loops with the same number of axis. So this is 
actually quite a major change is it not? 
   
   The model is at the task level, so how did it handle configs with different 
size features between training rounds?



##########
python/tvm/autotvm/tuner/xgboost_cost_model.py:
##########
@@ -329,15 +338,16 @@ def _get_feature(self, indexes):
             for i, fea in zip(need_extract, feas):
                 fea_cache[i] = fea.value if fea.status == StatusKind.COMPLETE 
else None
 
-        feature_len = None
+        feature_len = -1
         for idx in indexes:
             if fea_cache[idx] is not None:
-                feature_len = fea_cache[idx].shape[-1]
-                break
+                feature_len = max(fea_cache[idx].shape[-1], feature_len)
 
         ret = np.empty((len(indexes), feature_len), dtype=np.float32)
         for i, ii in enumerate(indexes):
             t = fea_cache[ii]
+            if t.shape[0] < feature_len:
+                t = np.pad(t, (0, feature_len - t.shape[0]))

Review Comment:
   I'm wondering if you've given thought on whether 0 is the most appropriate 
padding value. I know at this stage we don't have good insight into the actual 
corresponding features, but I am worried that 0 for some features may have some 
meaning we don't want. 



##########
python/tvm/autotvm/tuner/xgboost_cost_model.py:
##########
@@ -329,15 +338,16 @@ def _get_feature(self, indexes):
             for i, fea in zip(need_extract, feas):
                 fea_cache[i] = fea.value if fea.status == StatusKind.COMPLETE 
else None
 
-        feature_len = None
+        feature_len = -1
         for idx in indexes:
             if fea_cache[idx] is not None:
-                feature_len = fea_cache[idx].shape[-1]
-                break
+                feature_len = max(fea_cache[idx].shape[-1], feature_len)
 
         ret = np.empty((len(indexes), feature_len), dtype=np.float32)
         for i, ii in enumerate(indexes):
             t = fea_cache[ii]
+            if t.shape[0] < feature_len:
+                t = np.pad(t, (0, feature_len - t.shape[0]))

Review Comment:
   I'm wondering if you've given thought on whether 0 is the most appropriate 
padding value. I know at this stage we don't have good insight into the actual 
corresponding features, but I am worried that 0 for some features may have some 
meaning we don't want. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm] AndrewZhaoLuo commented on a diff in pull request #11479: [FIX] Pad feature vectors to the same size in xgboost cost model

Reply via email to