Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/17894
sorry for late update!
we tested on this PR against the current implementation with both dense and
sparse(0.95 sparsity):



The test on single machine was run on 100 samples on each feature set
scale, we can get performance gain (less training time) on both dense and
sparse dataset, on distributed case, we can also achieve a good performance
with fine tuning (num_cores, data partitions, etc..), but this change
inevitably put more constraint on memory and will bring up GC problem if no
enough memory is available on worker node, for sparse dataset on distributed
cluster, we are still unable to get a good result, so maybe we should bypass
this change for sparse case, but before making such change, I
d like to hear your thoughts on current test result we have, maybe we can
make it a better PR with your input :)
Thanks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]