[GitHub] [phoenix-connectors] stoty commented on a diff in pull request #92: PHOENIX-6859 Update phoenix5-spark3 README with PySpark code references

via GitHub Mon, 23 Jan 2023 10:46:28 -0800


stoty commented on code in PR #92:
URL: https://github.com/apache/phoenix-connectors/pull/92#discussion_r1084392227



##########
phoenix5-spark3/README.md:
##########
@@ -15,18 +15,59 @@ See the License for the specific language governing 
permissions and
 limitations under the License.
 -->
 
-phoenix-spark extends Phoenix's MapReduce support to allow Spark to load 
Phoenix tables as DataFrames,
-and enables persisting DataFrames back to Phoenix.
+# Phoenix5-Spark3 Connector
 
-## Configuring Spark to use the connector
+The phoenix5-spark3 plugin extends Phoenix's MapReduce support to allow Spark

Review Comment:
   I realize the we use "Plugin" on the website, but we should standardize on 
"Connector"



##########
phoenix5-spark3/README.md:
##########
@@ -15,18 +15,59 @@ See the License for the specific language governing 
permissions and
 limitations under the License.
 -->
 
-phoenix-spark extends Phoenix's MapReduce support to allow Spark to load 
Phoenix tables as DataFrames,
-and enables persisting DataFrames back to Phoenix.
+# Phoenix5-Spark3 Connector
 
-## Configuring Spark to use the connector
+The phoenix5-spark3 plugin extends Phoenix's MapReduce support to allow Spark
+ to load Phoenix tables as DataFrames,
+ and enables persisting DataFrames back to Phoenix.
 
-Use the shaded connector JAR `phoenix5-spark3-shaded-6.0.0-SNAPSHOT.jar` .
-Apart from the shaded connector JAR, you also need to add the hbase mapredcp 
libraries and the hbase configuration directory to the classpath. The final 
classpath should be something like
+## Pre-Requisites
 
-`/etc/hbase/conf:$(hbase mapredcp):phoenix5-spark3-shaded-6.0.0-SNAPSHOT.jar`
+* Phoenix 5.1.2+
+* Spark 3.0.3+
 
-(add the exact paths as appropiate to your system)
-Both the `spark.driver.extraClassPath` and `spark.executor.extraClassPath` 
properties need to be set the above classpath. You may add them 
spark-defaults.conf, or specify them on the spark-shell or spark-submit command 
line.
+## Why not JDBC
+
+Although Spark supports connecting directly to JDBC databases,
+ It’s only able to parallelize queries by partioning on a numeric column.
+ It also requires a known lower bound,
+ upper bound and partition count in order to create split queries.
+
+In contrast, the phoenix-spark integration is able to leverage the underlying
+ splits provided by Phoenix in order to retrieve and save data across multiple
+ workers. All that’s required is a database URL and a table name.

Review Comment:
   use "select statement" instead of "table name" ?



##########
phoenix5-spark3/README.md:
##########
@@ -15,18 +15,59 @@ See the License for the specific language governing 
permissions and
 limitations under the License.
 -->
 
-phoenix-spark extends Phoenix's MapReduce support to allow Spark to load 
Phoenix tables as DataFrames,
-and enables persisting DataFrames back to Phoenix.
+# Phoenix5-Spark3 Connector
 
-## Configuring Spark to use the connector
+The phoenix5-spark3 plugin extends Phoenix's MapReduce support to allow Spark
+ to load Phoenix tables as DataFrames,
+ and enables persisting DataFrames back to Phoenix.
 
-Use the shaded connector JAR `phoenix5-spark3-shaded-6.0.0-SNAPSHOT.jar` .
-Apart from the shaded connector JAR, you also need to add the hbase mapredcp 
libraries and the hbase configuration directory to the classpath. The final 
classpath should be something like
+## Pre-Requisites
 
-`/etc/hbase/conf:$(hbase mapredcp):phoenix5-spark3-shaded-6.0.0-SNAPSHOT.jar`
+* Phoenix 5.1.2+
+* Spark 3.0.3+
 
-(add the exact paths as appropiate to your system)
-Both the `spark.driver.extraClassPath` and `spark.executor.extraClassPath` 
properties need to be set the above classpath. You may add them 
spark-defaults.conf, or specify them on the spark-shell or spark-submit command 
line.
+## Why not JDBC
+
+Although Spark supports connecting directly to JDBC databases,
+ It’s only able to parallelize queries by partioning on a numeric column.
+ It also requires a known lower bound,
+ upper bound and partition count in order to create split queries.
+
+In contrast, the phoenix-spark integration is able to leverage the underlying
+ splits provided by Phoenix in order to retrieve and save data across multiple
+ workers. All that’s required is a database URL and a table name.
+ Optional SELECT columns can be given,
+ as well as pushdown predicates for efficient filtering.
+
+The choice of which method to use to access
+ Phoenix comes down to each specific use case.
+
+## Setup
+
+To setup connector add `phoenix5-spark3-shaded` JAR as

Review Comment:
   In most cases, you don't want to add the connector to the maven/compile 
classpath, it tends to cause conflicts when upgrading.
   
   We should move this to emd of the section, and add the caveat that this is 
only needed for the deprecated usages.



##########
phoenix5-spark3/README.md:
##########
@@ -39,7 +80,9 @@ UPSERT INTO TABLE1 (ID, COL1) VALUES (2, 'test_row_2');
 ```
 
 ### Load as a DataFrame using the DataSourceV2 API
+
 Scala example:
+

Review Comment:
   I know you didn't touch that part, but do we still need the SparkContext 
import ?



##########
phoenix5-spark3/README.md:
##########
@@ -39,7 +80,9 @@ UPSERT INTO TABLE1 (ID, COL1) VALUES (2, 'test_row_2');
 ```
 
 ### Load as a DataFrame using the DataSourceV2 API
+
 Scala example:
+

Review Comment:
   Maybe add comments to make  it obvious that you need to use a real ZK quorum 
, like 
   `//replace "phoenix-server:2181" with the real ZK quorum`



##########
phoenix5-spark3/README.md:
##########
@@ -15,18 +15,59 @@ See the License for the specific language governing 
permissions and
 limitations under the License.
 -->
 
-phoenix-spark extends Phoenix's MapReduce support to allow Spark to load 
Phoenix tables as DataFrames,
-and enables persisting DataFrames back to Phoenix.
+# Phoenix5-Spark3 Connector
 
-## Configuring Spark to use the connector
+The phoenix5-spark3 plugin extends Phoenix's MapReduce support to allow Spark
+ to load Phoenix tables as DataFrames,
+ and enables persisting DataFrames back to Phoenix.
 
-Use the shaded connector JAR `phoenix5-spark3-shaded-6.0.0-SNAPSHOT.jar` .
-Apart from the shaded connector JAR, you also need to add the hbase mapredcp 
libraries and the hbase configuration directory to the classpath. The final 
classpath should be something like
+## Pre-Requisites
 
-`/etc/hbase/conf:$(hbase mapredcp):phoenix5-spark3-shaded-6.0.0-SNAPSHOT.jar`
+* Phoenix 5.1.2+
+* Spark 3.0.3+
 
-(add the exact paths as appropiate to your system)
-Both the `spark.driver.extraClassPath` and `spark.executor.extraClassPath` 
properties need to be set the above classpath. You may add them 
spark-defaults.conf, or specify them on the spark-shell or spark-submit command 
line.
+## Why not JDBC
+
+Although Spark supports connecting directly to JDBC databases,
+ It’s only able to parallelize queries by partioning on a numeric column.
+ It also requires a known lower bound,
+ upper bound and partition count in order to create split queries.
+
+In contrast, the phoenix-spark integration is able to leverage the underlying
+ splits provided by Phoenix in order to retrieve and save data across multiple
+ workers. All that’s required is a database URL and a table name.
+ Optional SELECT columns can be given,
+ as well as pushdown predicates for efficient filtering.

Review Comment:
   This sounds like you need tospecify the pushdown predicates.
   Can you rephrase so that it's apparent that pushdown is automatic ?



##########
phoenix5-spark3/README.md:
##########
@@ -15,18 +15,59 @@ See the License for the specific language governing 
permissions and
 limitations under the License.
 -->
 
-phoenix-spark extends Phoenix's MapReduce support to allow Spark to load 
Phoenix tables as DataFrames,
-and enables persisting DataFrames back to Phoenix.
+# Phoenix5-Spark3 Connector
 
-## Configuring Spark to use the connector
+The phoenix5-spark3 plugin extends Phoenix's MapReduce support to allow Spark
+ to load Phoenix tables as DataFrames,
+ and enables persisting DataFrames back to Phoenix.
 
-Use the shaded connector JAR `phoenix5-spark3-shaded-6.0.0-SNAPSHOT.jar` .
-Apart from the shaded connector JAR, you also need to add the hbase mapredcp 
libraries and the hbase configuration directory to the classpath. The final 
classpath should be something like
+## Pre-Requisites
 
-`/etc/hbase/conf:$(hbase mapredcp):phoenix5-spark3-shaded-6.0.0-SNAPSHOT.jar`
+* Phoenix 5.1.2+
+* Spark 3.0.3+
 
-(add the exact paths as appropiate to your system)
-Both the `spark.driver.extraClassPath` and `spark.executor.extraClassPath` 
properties need to be set the above classpath. You may add them 
spark-defaults.conf, or specify them on the spark-shell or spark-submit command 
line.
+## Why not JDBC
+
+Although Spark supports connecting directly to JDBC databases,
+ It’s only able to parallelize queries by partioning on a numeric column.
+ It also requires a known lower bound,
+ upper bound and partition count in order to create split queries.
+
+In contrast, the phoenix-spark integration is able to leverage the underlying
+ splits provided by Phoenix in order to retrieve and save data across multiple
+ workers. All that’s required is a database URL and a table name.
+ Optional SELECT columns can be given,
+ as well as pushdown predicates for efficient filtering.
+
+The choice of which method to use to access
+ Phoenix comes down to each specific use case.

Review Comment:
   nit:
   This is super important, and we should have much more on this (though not 
necessarily in this ticket)



##########
phoenix5-spark3/README.md:
##########
@@ -15,18 +15,59 @@ See the License for the specific language governing 
permissions and
 limitations under the License.
 -->
 
-phoenix-spark extends Phoenix's MapReduce support to allow Spark to load 
Phoenix tables as DataFrames,
-and enables persisting DataFrames back to Phoenix.
+# Phoenix5-Spark3 Connector
 
-## Configuring Spark to use the connector
+The phoenix5-spark3 plugin extends Phoenix's MapReduce support to allow Spark
+ to load Phoenix tables as DataFrames,
+ and enables persisting DataFrames back to Phoenix.
 
-Use the shaded connector JAR `phoenix5-spark3-shaded-6.0.0-SNAPSHOT.jar` .
-Apart from the shaded connector JAR, you also need to add the hbase mapredcp 
libraries and the hbase configuration directory to the classpath. The final 
classpath should be something like
+## Pre-Requisites
 
-`/etc/hbase/conf:$(hbase mapredcp):phoenix5-spark3-shaded-6.0.0-SNAPSHOT.jar`
+* Phoenix 5.1.2+
+* Spark 3.0.3+
 
-(add the exact paths as appropiate to your system)
-Both the `spark.driver.extraClassPath` and `spark.executor.extraClassPath` 
properties need to be set the above classpath. You may add them 
spark-defaults.conf, or specify them on the spark-shell or spark-submit command 
line.
+## Why not JDBC
+
+Although Spark supports connecting directly to JDBC databases,
+ It’s only able to parallelize queries by partioning on a numeric column.
+ It also requires a known lower bound,
+ upper bound and partition count in order to create split queries.
+
+In contrast, the phoenix-spark integration is able to leverage the underlying
+ splits provided by Phoenix in order to retrieve and save data across multiple
+ workers. All that’s required is a database URL and a table name.
+ Optional SELECT columns can be given,
+ as well as pushdown predicates for efficient filtering.
+
+The choice of which method to use to access
+ Phoenix comes down to each specific use case.
+
+## Setup

Review Comment:
   Nit: this assumes that Phoenix and HBase/Spark are are both present and 
configured on the same nodes.
   Maybe worth mentioning it ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@phoenix.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [phoenix-connectors] stoty commented on a diff in pull request #92: PHOENIX-6859 Update phoenix5-spark3 README with PySpark code references

Reply via email to