[
https://issues.apache.org/jira/browse/SPARK-48463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854153#comment-17854153
]
Chhavi Bansal edited comment on SPARK-48463 at 6/11/24 6:32 PM:
----------------------------------------------------------------
[~weichenxu123] I tried using
{code:java}
new
StringIndexer().setInputCol("location.longitude").setOutputCol("longitutdee")
{code}
without flattening the dataset, but it fails before going to *getSelectedCols()
function,* inside the
{code:java}
validateAndTransformSchema$2(StringIndexer.scala:128) {code}
code itself. did it work for you ?
was (Author: JIRAUSER304338):
[~weichenxu123] I tried using
{code:java}
new
StringIndexer().setInputCol("location.longitude").setOutputCol("longitutdee")
{code}
without flattening the dataset, but it fails before going to *getSelectedCols()
function,* inside the
{code:java}
validateAndTransformSchema$2(StringIndexer.scala:128) {code}
code itself.
> MLLib function unable to handle nested data
> -------------------------------------------
>
> Key: SPARK-48463
> URL: https://issues.apache.org/jira/browse/SPARK-48463
> Project: Spark
> Issue Type: Bug
> Components: ML, MLlib
> Affects Versions: 3.5.1
> Reporter: Chhavi Bansal
> Priority: Major
> Labels: ML, MLPipelines, mllib, nested
>
> I am trying to use feature transformer on nested data after flattening, but
> it fails.
>
> {code:java}
> val structureData = Seq(
> Row(Row(10, 12), 1000),
> Row(Row(12, 14), 4300),
> Row( Row(37, 891), 1400),
> Row(Row(8902, 12), 4000),
> Row(Row(12, 89), 1000)
> )
> val structureSchema = new StructType()
> .add("location", new StructType()
> .add("longitude", IntegerType)
> .add("latitude", IntegerType))
> .add("salary", IntegerType)
> val df = spark.createDataFrame(spark.sparkContext.parallelize(structureData),
> structureSchema)
> def flattenSchema(schema: StructType, prefix: String = null, prefixSelect:
> String = null):
> Array[Column] = {
> schema.fields.flatMap(f => {
> val colName = if (prefix == null) f.name else (prefix + "." + f.name)
> val colnameSelect = if (prefix == null) f.name else (prefixSelect + "." +
> f.name)
> f.dataType match {
> case st: StructType => flattenSchema(st, colName, colnameSelect)
> case _ =>
> Array(col(colName).as(colnameSelect))
> }
> })
> }
> val flattenColumns = flattenSchema(df.schema)
> val flattenedDf = df.select(flattenColumns: _*){code}
> Now using the string indexer on the DOT notation.
>
> {code:java}
> val si = new
> StringIndexer().setInputCol("location.longitude").setOutputCol("longitutdee")
> val pipeline = new Pipeline().setStages(Array(si))
> pipeline.fit(flattenedDf).transform(flattenedDf).show() {code}
> The above code fails
> {code:java}
> xception in thread "main" org.apache.spark.sql.AnalysisException: Cannot
> resolve column name "location.longitude" among (location.longitude,
> location.latitude, salary); did you mean to quote the `location.longitude`
> column?
> at
> org.apache.spark.sql.errors.QueryCompilationErrors$.cannotResolveColumnNameAmongFieldsError(QueryCompilationErrors.scala:2261)
> at
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:258)
> at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:250)
> ..... {code}
> This points to the same failure as when we try to select dot notation columns
> in a spark dataframe, which is solved using BACKTICKS *`column.name`.*
> [https://stackoverflow.com/a/51430335/11688337]
>
> *so next*
> I use the back ticks while defining stringIndexer
> {code:java}
> val si = new
> StringIndexer().setInputCol("`location.longitude`").setOutputCol("longitutdee")
> {code}
> In this case *it again fails* (with a diff reason) in the stringIndexer code
> itself
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: Input column
> `location.longitude` does not exist.
> at
> org.apache.spark.ml.feature.StringIndexerBase.$anonfun$validateAndTransformSchema$2(StringIndexer.scala:128)
> at
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244)
> at
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> at
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> {code}
>
> This blocks me to use feature transformation functions on nested columns.
> Any help in solving this problem will be highly appreciated.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]