[
https://issues.apache.org/jira/browse/SPARK-20081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967196#comment-15967196
]
Yan Facai (颜发才) edited comment on SPARK-20081 at 4/13/17 6:48 AM:
------------------------------------------------------------------
[~creinig] Christian, RandomForestClassifier use numClass to calculate memory
space needed.
As far as I know, numClass is inferred by `getNumClass` of Classifier, and now
explicitly `setNumClass` is missing. I don't know whether it is in the future
plan.
Moreover, NominalAttribute is private[ml]. It seems that we cannot modify
metadata outside.
But, you can use `StringIndexer` to transform your label column, and
StringIndexer will help you construct correct metadata (nomAttr.getNumValues).
Its usage see:
http://spark.apache.org/docs/latest/ml-features.html#stringindexer
The solution is a little tricky. How about it?
ping [~josephkb]
was (Author: facai):
[~creinig] Christian, RandomForestClassifier use numClass to calculate memory
space needed.
As far as I know, numClass is inferred by `getNumClass` of Classifier, and now
explicitly `setNumClass` is missing. I don't know whether it is in the future
plan.
In fact, you can use `StringIndexer` to transform your label column, and
StringIndexer will help you construct correct metadata (nomAttr.getNumValues).
Its usage see:
http://spark.apache.org/docs/latest/ml-features.html#stringindexer
The solution is a little tricky. How about it?
ping [~josephkb]
> RandomForestClassifier doesn't seem to support more than 100 labels
> -------------------------------------------------------------------
>
> Key: SPARK-20081
> URL: https://issues.apache.org/jira/browse/SPARK-20081
> Project: Spark
> Issue Type: Bug
> Components: ML, MLlib
> Affects Versions: 2.1.0
> Environment: Java
> Reporter: Christian Reiniger
>
> When feeding data with more than 100 labels into RanfomForestClassifer#fit()
> (from java code), I get the following error message:
> {code}
> Classifier inferred 143 from label values in column
> rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100)
> allowed to be inferred from values.
> To avoid this error for labels with > 100 classes, specify numClasses
> explicitly in the metadata; this can be done by applying StringIndexer to the
> label column.
> {code}
> Setting "numClasses" in the metadata for the label column doesn't make a
> difference. Looking at the code, this is not surprising, since
> MetadataUtils.getNumClasses() ignores this setting:
> {code:language=scala}
> def getNumClasses(labelSchema: StructField): Option[Int] = {
> Attribute.fromStructField(labelSchema) match {
> case binAttr: BinaryAttribute => Some(2)
> case nomAttr: NominalAttribute => nomAttr.getNumValues
> case _: NumericAttribute | UnresolvedAttribute => None
> }
> }
> {code}
> The alternative would be to pass a proper "maxNumClasses" parameter to the
> classifier, so that Classifier#getNumClasses() allows a larger number of
> auto-detected labels. However, RandomForestClassifer#train() calls
> #getNumClasses without the "maxNumClasses" parameter, causing it to use the
> default of 100:
> {code:language=scala}
> override protected def train(dataset: Dataset[_]):
> RandomForestClassificationModel = {
> val categoricalFeatures: Map[Int, Int] =
> MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))
> val numClasses: Int = getNumClasses(dataset)
> // ...
> {code}
> My scala skills are pretty sketchy, so please correct me if I misinterpreted
> something. But as it seems right now, there is no way to learn from data with
> more than 100 labels via RandomForestClassifier.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]