Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-73599916
@srowen I'm thinking of connecting `FeatureAttributes` and SQL's `Metadata`
through factory methods but not via constructors.
~~~
val attributes = FeatureAttributes.fromMetadata(metadata)
val metadata = attributes.toMetadata
~~~
And we can build `FeatureAttributes` using a builder. This would be easier
for us to write unit tests and manipulate feature attributes.
By `feature dimension`, I mean the vector size for a vector-typed feature
column.
I see that you made feature names fundamental. It might be hard to provide
every feature a name, especially after a series of feature transformations.
Essentially, we want to maintain a bi-map between feature indices and names,
while some names might be missing. It would be better if we can feature index
as the feature identifier.
Given a `FeatureAttributes` instances, the following methods seem to be
necessary:
1. `size(): Int`: 1 for a scalar column, and vectorSize for a vector column
1. `producer: String`, log who produces those attributes
2. `getFeatureAttribute(index: Int): Attribute`: gets a feature's attribute
from its index
3. `getFeatureIndex(name: String): gets a feature's index from its name
4. `categoricalFeatureIndices(): Array[Int]`: returns a list of categorical
features
The construct will look like
~~~
class FeatureAttributes(val attributes: Array[Attribute], val producer:
String)
~~~
And we maintain a map from feature name to feature indices internally. For
each `Attribute`, we can put the following fields:
1. `featureType: Int`: continuous, categorical, etc
2. `name: String`
Then we can have `ContinuousAttribute` and `CategoricalAttribute` as two
subclasses of `Attribute` and each implements its own methods and
`toMetadata`/`fromMetadata` methods. For a continuous feature, we might want to
leave slots for min, max, support, etc, and for a categorical attribute, we
want to at least have `categories: Array[String]` and `numCategories`.
This is basically my brain dump ... we definitely need to revise them when
we get into the details.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]