[
https://issues.apache.org/jira/browse/PHOENIX-6559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419403#comment-17419403
]
ASF GitHub Bot commented on PHOENIX-6559:
-----------------------------------------
stoty closed pull request #63:
URL: https://github.com/apache/phoenix-connectors/pull/63
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> spark connector access to SmallintArray / UnsignedSmallintArray columns
> -----------------------------------------------------------------------
>
> Key: PHOENIX-6559
> URL: https://issues.apache.org/jira/browse/PHOENIX-6559
> Project: Phoenix
> Issue Type: Bug
> Components: connectors, spark-connector
> Affects Versions: connectors-6.0.0
> Reporter: Alvaro Fernandez
> Assignee: Alvaro Fernandez
> Priority: Major
> Fix For: connectors-6.0.0
>
> Attachments: PHOENIX-6559.master.v1.patch
>
>
> We have some tables defined with SMALLINT array[] columns, that are not
> accessible correctly with the spark connector.
> Seems that the Spark data type is incorrectly inferred by the connector as an
> array of integers ArrayType(IntegerType), instead of ArrayType(ShortType).
> A table example:
> {code:java}
> CREATE TABLE IF NOT EXISTS AEIDEV.ARRAY_TABLE (ID BIGINT NOT NULL PRIMARY
> KEY, COL1 SMALLINT ARRAY[] );
> UPSERT INTO AEIDEV.ARRAY_TABLE VALUES (1, ARRAY[-32678,-9876,-234,-1]);
> UPSERT INTO AEIDEV.ARRAY_TABLE VALUES (2, ARRAY[0,8,9,10]);
> UPSERT INTO AEIDEV.ARRAY_TABLE VALUES (3, ARRAY[123,1234,12345,32767]);{code}
> Accessing the values from Spark gives wrong values:
>
> {code:java}
> scala> val df =
> spark.sqlContext.read.format("org.apache.phoenix.spark").option("table","AEIDEV.ARRAY_TABLE").option("zkUrl","ithdp1101.cern.ch:2181").load
> df: org.apache.spark.sql.DataFrame = [ID: bigint, COL1: array<int>]
> scala> df.show
> ---------------------+
> ID COL1
> ---------------------+
> 1 [-647200678, -234... 2 [524288, 655369, ... 3 [80871547, 214743...
> ---------------------+
> scala> df.collect
> res3: Array[org.apache.spark.sql.Row] = Array([1,WrappedArray(-647200678,
> -234, 0, 0)], [2,WrappedArray(524288, 655369, 0, 0)],
> [3,WrappedArray(80871547, 2147430457, 0, 0)])
> {code}
> We have identified the problem in the SparkSchemaUtil class, and applied the
> tiny patch included in the report. After this, the data type is correctly
> inferred and results are correct:
>
> {code:java}
> scala> val df =
> spark.sqlContext.read.format("org.apache.phoenix.spark").option("table","AEIDEV.ARRAY_TABLE").option("zkUrl","ithdp1101.cern.ch:2181").load
> df: org.apache.spark.sql.DataFrame = [ID: bigint, COL1: array<smallint>]
> scala> df.show
> ---------------------+
> ID COL1
> ---------------------+
> 1 [-32678, -9876, -... 2 [0, 8, 9, 10] 3 [123, 1234, 12345...
> ---------------------+
> scala> df.collect
> res1: Array[org.apache.spark.sql.Row] = Array([1,WrappedArray(-32678, -9876,
> -234, -1)], [2,WrappedArray(0, 8, 9, 10)], [3,WrappedArray(123, 1234, 12345,
> 32767)])
> {code}
>
>
> We can provide more information and submit a merge request if needed.
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)