Github user ssimeonov commented on the pull request:
https://github.com/apache/spark/pull/12498#issuecomment-211949147
When the "hashing trick" is used in practice, it is important to do things
such as monitor, manage or randomize collisions. If that show there are
problems with the chosen hashing approach, then one would experiment with the
hashing function. All this suggests that a hashing function should be treated
as an object with a simple interface, perhaps as simple as `Function1[Any,
Int]`. Collision monitoring can then be performed with a decorator with an
accumulator. Collision management would be performed by varying the seed or
adding salt. Collision randomization would be performed by varying the
seed/salt with each run and/or running multiple models in production which are
identical expect for the different seed/salt used.
The hashing trick is very important in ML and quite... tricky... to get
working well for complex, high-dimension spaces, which Spark is perfect for. An
implementation that does not treat the hashing function as a first class object
would substantially hinder MLlib's capabilities in practice.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]