Github user sethah commented on the issue:
https://github.com/apache/spark/pull/15831
I see this patch was created as a result of the PR that separated the
ml/mllib linalg packages, to avoid some inefficiencies in conversion. However,
it also is a partial step toward feature parity. Typically, we would port full
algorithms all at once, instead of just porting the transformer functionality
as is done here, but I understand that there is not just about parity. I would
suggest one of the following:
1. Port over full feature functionality. This increases the scope and
therefore the algos should probably separated out individually into PRs.
2. Keep the scope the same, but avoid copying code.
For an example of option 2, for `ChiSqSelector`, we can implement new
static methods in the `mllib.ChiSqSelectorModel`:
````scala
private[spark] def compressDense(
selectedFeatures: Array[Int],
values: Array[Double]): Array[Double] = {
selectedFeatures.map(i => values(i))
}
private[spark] def compressSparse(
compressedSize: Int,
selectedFeatures: Array[Int],
indices: Array[Int],
values: Array[Double]): (Array[Int], Array[Double]) = {
...
}
````
then in the actual model classes we can just do something like:
````scala
private def compress(features: Vector): Vector = {
features match {
case SparseVector(_, indices, values) =>
val newSize = selectedFeatures.length
val (newIndices, newValues) =
ChiSqSelectorModel.compressSparse(newSize, selectedFeatures,
indices, values)
Vectors.sparse(newSize, newIndices, newValues)
case DenseVector(values) =>
Vectors.dense(ChiSqSelectorModel.compressDense(selectedFeatures,
values))
}
}
````
This approach would allow us to avoid copying a lot of code until we do
full feature ports. What are others opinions? I lean towards the second option
since it keeps the scope reasonable.
cc @dbtsai @yanboliang
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]