[ 
https://issues.apache.org/jira/browse/PHOENIX-7377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

rejeb ben rejeb updated PHOENIX-7377:
-------------------------------------
    Description: 
Upgrade of the spark connector to use Datasource v2 api made a major change in 
the way the schema is inferred.

In previous versions of the connector, for non default column family , columns 
mapped to "columnName" in DataFrame. Now, they are mapped to 
"columnFamily.columnName".

There are no unit tests that cover this case, all tests uses tables with 
default column family "0".

The change is made is this [pull 
request|https://github.com/apache/phoenix/pull/423] (the project was moved to 
another git repo since):
 * In previous version code uses `ColumnInfo.getDisplayName` to define the name 
of the column in the DF.
 * The new class SparkSchemaUtil the method used is  `ColumnInfo.getColumnName` 
which returns the columnName as `columnFamilyName.columnName`.

The pull request is related to this ticket PHOENIX-5059 the change is not 
documented.

This change breaks jobs reading from tables having a non default column family.

The saprk3 connector have the same issue since code has been duplicated from 
spark2 module to spark3 module.

Since V1 api has been modified to use same method to resolve schema it has the 
same behavior and it should not bcause they are now a deprecated classes and 
should not contain a braking change.

 

*Resolution proposal:*

The best way to fix the issue is to have a flag to enable original schema 
mapping. 

The issue is in spark connector and it's resolution will not have a side effect 
on other phoenix-connectors like phoenix5-hive for example.

  was:
Upgrade of the spark connector to use Datasource v2 api made a major change in 
the way the schema is inferred.

In porevious versions of the connector


> phoenix5-spark dataframe issue with schema inference
> ----------------------------------------------------
>
>                 Key: PHOENIX-7377
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-7377
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: rejeb ben rejeb
>            Priority: Major
>
> Upgrade of the spark connector to use Datasource v2 api made a major change 
> in the way the schema is inferred.
> In previous versions of the connector, for non default column family , 
> columns mapped to "columnName" in DataFrame. Now, they are mapped to 
> "columnFamily.columnName".
> There are no unit tests that cover this case, all tests uses tables with 
> default column family "0".
> The change is made is this [pull 
> request|https://github.com/apache/phoenix/pull/423] (the project was moved to 
> another git repo since):
>  * In previous version code uses `ColumnInfo.getDisplayName` to define the 
> name of the column in the DF.
>  * The new class SparkSchemaUtil the method used is  
> `ColumnInfo.getColumnName` which returns the columnName as 
> `columnFamilyName.columnName`.
> The pull request is related to this ticket PHOENIX-5059 the change is not 
> documented.
> This change breaks jobs reading from tables having a non default column 
> family.
> The saprk3 connector have the same issue since code has been duplicated from 
> spark2 module to spark3 module.
> Since V1 api has been modified to use same method to resolve schema it has 
> the same behavior and it should not bcause they are now a deprecated classes 
> and should not contain a braking change.
>  
> *Resolution proposal:*
> The best way to fix the issue is to have a flag to enable original schema 
> mapping. 
> The issue is in spark connector and it's resolution will not have a side 
> effect on other phoenix-connectors like phoenix5-hive for example.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to