Pavlo Borshchenko created SPARK-33401:
-----------------------------------------

             Summary: Vector type column is not possible to create using spark 
SQL
                 Key: SPARK-33401
                 URL: https://issues.apache.org/jira/browse/SPARK-33401
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.0.1
            Reporter: Pavlo Borshchenko


 

Created table with vector type column:
{code:java}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.VectorUDT
import org.apache.spark.mllib.linalg.Vectors
case class Test(features: Vector) 
Seq(Test(Vectors.dense(Array(1d, 2d, 3d)))).toDF()
 .write
 .mode("overwrite")
 .saveAsTable("pborshchenko.test_vector_spark_0911_1")
{code}
 

Show the create table statement for this created table:
{code:java}
spark.sql("SHOW CREATE TABLE pborshchenko.test_vector_spark_0911_1"){code}
Got:
{code:java}
CREATE TABLE `pborshchenko`.`test_vector_spark_0911_1` (
 `features` STRUCT<`type`: TINYINT, `size`: INT, `indices`: ARRAY<INT>, 
`values`: ARRAY<DOUBLE>>)
USING parquet{code}
Create the same table with index 2 at the end:
{code:java}
spark.sql("CREATE TABLE `pborshchenko`.`test_vector_spark_0911_2` (\n`features` 
STRUCT<`type`: TINYINT, `size`: INT, `indices`: ARRAY<INT>, `values`: 
ARRAY<DOUBLE>>)\nUSING parquet"){code}
Try to insert new values to the table created from SQL:

 
{code:java}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.VectorUDT
import org.apache.spark.mllib.linalg.Vectors
case class Test(features: Vector)
Seq(Test(Vectors.dense(Array(1d, 2d, 3d)))).toDF()
 .write
 .mode(SaveMode.Append)
 .insertInto("pborshchenko.test_vector_spark_0911_2")
{code}
 

Got:
 
{code:java}
 AnalysisException: Cannot write incompatible data to table 
'`pborshchenko`.`test_vector_spark_0911_2`': - Cannot write 'features': 
struct<type:tinyint,size:int,indices:array<int>,values:array<double>> is 
incompatible with 
struct<type:tinyint,size:int,indices:array<int>,values:array<double>>;      - 
Cannot write 'features': 
struct<type:tinyint,size:int,indices:array<int>,values:array<double>> is 
incompatible with 
struct<type:tinyint,size:int,indices:array<int>,values:array<double>>; at 
org.apache.spark.sql.catalyst.analysis.TableOutputResolver$.resolveOutputColumns(TableOutputResolver.scala:72)
 at 
org.apache.spark.sql.execution.datasources.PreprocessTableInsertion.org$apache$spark$sql$execution$datasources$PreprocessTableInsertion$$preprocess(rules.scala:467)
 at 
org.apache.spark.sql.execution.datasources.PreprocessTableInsertion$$anonfun$apply$3.applyOrElse(rules.scala:494)
 at 
org.apache.spark.sql.execution.datasources.PreprocessTableInsertion$$anonfun$apply$3.applyOrElse(rules.scala:486)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:112)
    {code}
 
The reason that table created from spark SQL has the type STRUCT, not vector, 
but this struct is the right representation for vector type.

AC: Should be possible to create a table using spark SQL with vector type 
column and after that write to it without any errors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to