Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    @smurching I found some issues and have some thoughts on the columnar 
features format:
    
    - In your doc, you said "Specifically, we only need to store sufficient 
stats for each bin of a single feature, as opposed to each bin of every 
feature", BUT, current implementation, you still allocate space for all 
features when computing: -- see `DTStatsAggregator` implementation, you pass 
`featureSubset = None` so `DTStatsAggregator` will allocate space for every 
features. According to your purpose, you should pass `featureSubset = 
Some(Array(currentFeatureIndex))`.
    
    - Current implementation still use binnedFeatures. You said in future it 
will be improved to sort feature values for continuous feature (for more 
precise tree training), if you want to consider every possible thresholds, you 
need hold rawFeatures instead of binnedFeatures in the columnar feature array, 
and in each split range offset, you need sort every continuous features. Is 
this the thing you want to do in the future ? This will increase calculation 
amount.
    
    - For current implementation(using binnedFeature) , there is no need to 
sort continuous features inside each split offset. So the `indices` for each 
feature is exactly the same. In order to save memory, I think these `indices` 
should be shared, no need to create separate indices array for each features. 
Even if you add the improvements for continuous features mentioned above, you 
can create separate `indices` array for **only** continuous features, the 
categorical features can still share the same `indices` array.
    
    - About locality advantage of columnar format, I have some doubts. Current 
implementation, you do not reorder the `label` and `weight` array, access 
`label` and `weight` value need use `indices`, when calculating `DTStat`, this 
break locality. (But I'm not sure how much impact to perf this will bring).
    
    - About the overhead of columnar format: when making reordering (when get 
new split, we need reorder left sub-tree samples into front), so you need 
reordering on each column, and at the same time, update the `indices` array. 
But, if we use row format, like:
    `Array[(features, label, weight)]`, reordering will be much easier, and do 
not need indices.
    So, I am considering, whether we can use row format, but at the time when 
we need `DTStatsAggregator` computation, copy the data we need from the row 
format into columnar format array (only need to copy rows between sub-node 
offset and only copy the sampled features if using feature subsampling).



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to