[ https://issues.apache.org/jira/browse/SPARK-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Maciej Szymkiewicz updated SPARK-10467: --------------------------------------- Description: If we take a row from a data frame and try to extract vector element by index it is converted to tuple: {code} from pyspark.ml.feature import HashingTF df = sqlContext.createDataFrame([(["foo", "bar"], )], ("keys", )) transformer = HashingTF(inputCol="keys", outputCol="vec", numFeatures=5) transformed = transformer.transform(df) row = transformed.first() row.vec # As expected ## SparseVector(5, {4: 2.0}) row[1] # Returns tuple ## (0, 5, [4], [2.0]) {code} Problem cannot be reproduced if we create and access Row directly: {code} from pyspark.mllib.linalg import Vectors from pyspark.sql.types import Row row = Row(vec=Vectors.sparse(3, [(0, 1)])) row.vec ## SparseVector(3, {0: 1.0}) row[0] ## SparseVector(3, {0: 1.0}) {code} but if use data frame on above {code} df = sqlContext.createDataFrame([row], ("vec", )) df.first()[0] ## (0, 3, [0], [1.0]) {code} was: {code} from pyspark.ml.feature import HashingTF df = sqlContext.createDataFrame([(["foo", "bar"], )], ("keys", )) transformer = HashingTF(inputCol="keys", outputCol="vec", numFeatures=5) transformed = transformer.transform(df) row = transformed.first() row.vec # As expected ## SparseVector(5, {4: 2.0}) row[1] # Returns tuple ## (0, 5, [4], [2.0]) {code} Problem cannot be reproduced if we create Row directly: {code} from pyspark.mllib.linalg import Vectors from pyspark.sql.types import Row row = Row(vec=Vectors.sparse(3, [(0, 1)])) row.vec ## SparseVector(3, {0: 1.0}) row[0] ## SparseVector(3, {0: 1.0}) {code} > Vector is converted to tuple when extracted from Row using __getitem__ > ---------------------------------------------------------------------- > > Key: SPARK-10467 > URL: https://issues.apache.org/jira/browse/SPARK-10467 > Project: Spark > Issue Type: Bug > Components: ML, PySpark, SQL > Affects Versions: 1.4.1 > Reporter: Maciej Szymkiewicz > Priority: Minor > > If we take a row from a data frame and try to extract vector element by index > it is converted to tuple: > {code} > from pyspark.ml.feature import HashingTF > df = sqlContext.createDataFrame([(["foo", "bar"], )], ("keys", )) > transformer = HashingTF(inputCol="keys", outputCol="vec", numFeatures=5) > transformed = transformer.transform(df) > row = transformed.first() > row.vec # As expected > ## SparseVector(5, {4: 2.0}) > row[1] # Returns tuple > ## (0, 5, [4], [2.0]) > {code} > Problem cannot be reproduced if we create and access Row directly: > {code} > from pyspark.mllib.linalg import Vectors > from pyspark.sql.types import Row > row = Row(vec=Vectors.sparse(3, [(0, 1)])) > row.vec > ## SparseVector(3, {0: 1.0}) > row[0] > ## SparseVector(3, {0: 1.0}) > {code} > but if use data frame on above > {code} > df = sqlContext.createDataFrame([row], ("vec", )) > df.first()[0] > ## (0, 3, [0], [1.0]) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org