Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/15148
  
    At a high level I like the idea here and the work that's gone into a 
unified interface. A few comments:
    
    #### Data types
    I'm not that keen on mixing up the input data types between `Vector`, 
`Array[Double]` and (later) `Array[Boolean]`. I think we should stick with 
`Vector` throughout.
    
    For `MinHash` what is the thinking behind `Array[Double]` rather than 
`Vector`?
    
    I can see for binary (i.e. hamming dist) that `Array[Boolean]` is 
attractive as a kind of type safety thing, but still I think a `Vector` 
interface is more natural.
    
    In both cases the input could be sparse, right? So forcing arrays as input 
can have some space implications. `Vector` also neatly allows supporting dense 
and sparse cases.
    
    #### NN search
    It seems to me that, while yes technically this is a "transformer" to a 
low-dimensional representation (so `transform` outputs the lower-dim vectors), 
the main use case is either approx NN (aka top K) or the similarity join. 
(correct me if I'm wrong, but generally the low-dim vectors are not used as 
inputs in some model, say, such as the case for PCA / SVD etc, but rather for 
the approx similarity search).
    
    For `approxNearestNeighbors`, a common use case is in recommendations, to 
efficiently support top-k recommendations across an entire dataset. This can't 
so easily be achieved with the self `approxSimilarityJoin` because usually we 
want up to `k` recommended items (or most similar items), and how do we select 
the similarity threshold to achieve this? It's data dependent. 
    
    So I do think we need an efficient way to do `approxNearestNeighbors` over 
a DataFrame of inputs rather than only one key at a time. I'd like to see this 
applied to predict top-k with `ALSModel` as that will enable efficient 
prediction (and make cross-validation on ranking metrics feasible). The current 
approach when applied to say computing top-k most similar items for each of 1 
million items, would not I think be scalable. Perhaps either the ANN approach 
can be extended to multiple inputs, or the similarity join can be extended to 
also handle `k` neighbors per item rather than the similarity threshold.
    
    I'd be interested to hear your other use cases - is it mainly similarity 
join, or really doing ANN on only 1 item?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to