[ 
https://issues.apache.org/jira/browse/PHOENIX-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996911#comment-15996911
 ] 

Josh Mahonin commented on PHOENIX-3814:
---------------------------------------

I've managed to spend a bit of time looking at this, and it does seem the 
solution to PHOENIX-3721 should solve this issue as well, though I'm fine to 
keep it open until it's closed.

Re: SaveMode behaviour, these are good starting points to look at:

First point of contact from Spark into Phoenix when saving a DataFrame:
https://github.com/apache/phoenix/blob/5b099014446865c12779f3882fd8b407496717ea/phoenix-spark/src/main/scala/org/apache/phoenix/spark/DefaultSource.scala#L40-L47

The code that unwraps the DataFrame into an RDD, then uses the Phoenix 
MapReduce-style classes to affect a distributed save() through Spark:
https://github.com/apache/phoenix/blob/5b099014446865c12779f3882fd8b407496717ea/phoenix-spark/src/main/scala/org/apache/phoenix/spark/DataFrameFunctions.scala#L31-L65

Relevant unit tests:
https://github.com/apache/phoenix/blob/5b099014446865c12779f3882fd8b407496717ea/phoenix-spark/src/it/scala/org/apache/phoenix/spark/PhoenixSparkIT.scala#L336-L342

According to PHOENIX-2745, the current 'Overwrite' acts more like the 'Append' 
should. I'm not entirely sure if that's accurate since the definitions in the 
docs say this

{quote}
Append mode means that when saving a DataFrame to a data source, if data/table 
already exists, contents of the DataFrame are expected to be appended to 
existing data.
Overwrite mode means that when saving a DataFrame to a data source, if 
data/table already exists, existing data is expected to be overwritten by the 
contents of the DataFrame.
{quote}

Neither definition implies dropping or recreating the table. Also, under the 
hood, each row of the DataFrame is turned into a Phoenix UPSERT statement, 
which by definition does both.
bq. Upsert: Inserts if not present and updates otherwise the value in the table

However, if the current method is counter to how other data sources operate, 
I'm open to suggestions. I suggest continuing this conversation over on 
PHOENIX-2745.

> Unable to connect to Phoenix via Spark
> --------------------------------------
>
>                 Key: PHOENIX-3814
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3814
>             Project: Phoenix
>          Issue Type: Bug
>    Affects Versions: 4.10.0
>         Environment: Ubuntu 16.04.1, Apache Spark 2.1.0, Hbase 1.2.5, Phoenix 
> 4.10.0
>            Reporter: Wajid Khattak
>
> Please see 
> http://stackoverflow.com/questions/43640864/apache-phoenix-for-spark-not-working



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to