Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/15148
At a high level I like the idea here and the work that's gone into a
unified interface. A few comments:
#### Data types
I'm not that keen on mixing up the input data types between `Vector`,
`Array[Double]` and (later) `Array[Boolean]`. I think we should stick with
`Vector` throughout.
For `MinHash` what is the thinking behind `Array[Double]` rather than
`Vector`?
I can see for binary (i.e. hamming dist) that `Array[Boolean]` is
attractive as a kind of type safety thing, but still I think a `Vector`
interface is more natural.
In both cases the input could be sparse, right? So forcing arrays as input
can have some space implications. `Vector` also neatly allows supporting dense
and sparse cases.
#### NN search
It seems to me that, while yes technically this is a "transformer" to a
low-dimensional representation (so `transform` outputs the lower-dim vectors),
the main use case is either approx NN (aka top K) or the similarity join.
(correct me if I'm wrong, but generally the low-dim vectors are not used as
inputs in some model, say, such as the case for PCA / SVD etc, but rather for
the approx similarity search).
For `approxNearestNeighbors`, a common use case is in recommendations, to
efficiently support top-k recommendations across an entire dataset. This can't
so easily be achieved with the self `approxSimilarityJoin` because usually we
want up to `k` recommended items (or most similar items), and how do we select
the similarity threshold to achieve this? It's data dependent.
So I do think we need an efficient way to do `approxNearestNeighbors` over
a DataFrame of inputs rather than only one key at a time. I'd like to see this
applied to predict top-k with `ALSModel` as that will enable efficient
prediction (and make cross-validation on ranking metrics feasible). The current
approach when applied to say computing top-k most similar items for each of 1
million items, would not I think be scalable. Perhaps either the ANN approach
can be extended to multiple inputs, or the similarity join can be extended to
also handle `k` neighbors per item rather than the similarity threshold.
I'd be interested to hear your other use cases - is it mainly similarity
join, or really doing ANN on only 1 item?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]