Github user ssimeonov commented on the pull request:

    https://github.com/apache/spark/pull/12498#issuecomment-211949147
  
    When the "hashing trick" is used in practice, it is important to do things 
such as monitor, manage or randomize collisions. If that show there are 
problems with the chosen hashing approach, then one would experiment with the 
hashing function. All this suggests that a hashing function should be treated 
as an object with a simple interface, perhaps as simple as `Function1[Any, 
Int]`. Collision monitoring can then be performed with a decorator with an 
accumulator. Collision management would be performed by varying the seed or 
adding salt. Collision randomization would be performed by varying the 
seed/salt with each run and/or running multiple models in production which are 
identical expect for the different seed/salt used.
    
    The hashing trick is very important in ML and quite... tricky... to get 
working well for complex, high-dimension spaces, which Spark is perfect for. An 
implementation that does not treat the hashing function as a first class object 
would substantially hinder MLlib's capabilities in practice.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to