[
https://issues.apache.org/jira/browse/PHOENIX-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996911#comment-15996911
]
Josh Mahonin commented on PHOENIX-3814:
---------------------------------------
I've managed to spend a bit of time looking at this, and it does seem the
solution to PHOENIX-3721 should solve this issue as well, though I'm fine to
keep it open until it's closed.
Re: SaveMode behaviour, these are good starting points to look at:
First point of contact from Spark into Phoenix when saving a DataFrame:
https://github.com/apache/phoenix/blob/5b099014446865c12779f3882fd8b407496717ea/phoenix-spark/src/main/scala/org/apache/phoenix/spark/DefaultSource.scala#L40-L47
The code that unwraps the DataFrame into an RDD, then uses the Phoenix
MapReduce-style classes to affect a distributed save() through Spark:
https://github.com/apache/phoenix/blob/5b099014446865c12779f3882fd8b407496717ea/phoenix-spark/src/main/scala/org/apache/phoenix/spark/DataFrameFunctions.scala#L31-L65
Relevant unit tests:
https://github.com/apache/phoenix/blob/5b099014446865c12779f3882fd8b407496717ea/phoenix-spark/src/it/scala/org/apache/phoenix/spark/PhoenixSparkIT.scala#L336-L342
According to PHOENIX-2745, the current 'Overwrite' acts more like the 'Append'
should. I'm not entirely sure if that's accurate since the definitions in the
docs say this
{quote}
Append mode means that when saving a DataFrame to a data source, if data/table
already exists, contents of the DataFrame are expected to be appended to
existing data.
Overwrite mode means that when saving a DataFrame to a data source, if
data/table already exists, existing data is expected to be overwritten by the
contents of the DataFrame.
{quote}
Neither definition implies dropping or recreating the table. Also, under the
hood, each row of the DataFrame is turned into a Phoenix UPSERT statement,
which by definition does both.
bq. Upsert: Inserts if not present and updates otherwise the value in the table
However, if the current method is counter to how other data sources operate,
I'm open to suggestions. I suggest continuing this conversation over on
PHOENIX-2745.
> Unable to connect to Phoenix via Spark
> --------------------------------------
>
> Key: PHOENIX-3814
> URL: https://issues.apache.org/jira/browse/PHOENIX-3814
> Project: Phoenix
> Issue Type: Bug
> Affects Versions: 4.10.0
> Environment: Ubuntu 16.04.1, Apache Spark 2.1.0, Hbase 1.2.5, Phoenix
> 4.10.0
> Reporter: Wajid Khattak
>
> Please see
> http://stackoverflow.com/questions/43640864/apache-phoenix-for-spark-not-working
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)