[
https://issues.apache.org/jira/browse/SPARK-30347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
zhengruifeng resolved SPARK-30347.
----------------------------------
Fix Version/s: 3.0.0
Resolution: Fixed
Issue resolved by pull request 27003
[https://github.com/apache/spark/pull/27003]
> LibSVMDataSource attach AttributeGroup
> --------------------------------------
>
> Key: SPARK-30347
> URL: https://issues.apache.org/jira/browse/SPARK-30347
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 3.0.0
> Reporter: zhengruifeng
> Assignee: zhengruifeng
> Priority: Minor
> Fix For: 3.0.0
>
>
> LibSVMDataSource will attach a special metadata to indicate numFeatures.
> {code:java}
> scala> val data =
> spark.read.format("libsvm").load("/data0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt")
> 19/12/24 18:40:09 WARN LibSVMFileFormat: 'numFeatures' option not specified,
> determining the number of features by going though the input. If you know the
> number in advance, please specify it via 'numFeatures' option to avoid the
> extra scan.
> data: org.apache.spark.sql.DataFrame = [label: double, features:
> vector]scala> data.schema("features").metadata
> res0: org.apache.spark.sql.types.Metadata = {"numFeatures":4}
> {code}
> However, ML impls all try to obtain the vector size via \{{AttributeGroup}},
> which can not use this metadata:
> {code:java}
> scala> import org.apache.spark.ml.attribute._
> import org.apache.spark.ml.attribute._scala>
> AttributeGroup.fromStructField(data.schema("features")).size
> res1: Int = -1
> {code}
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]