stoty commented on code in PR #92: URL: https://github.com/apache/phoenix-connectors/pull/92#discussion_r1084392227
########## phoenix5-spark3/README.md: ########## @@ -15,18 +15,59 @@ See the License for the specific language governing permissions and limitations under the License. --> -phoenix-spark extends Phoenix's MapReduce support to allow Spark to load Phoenix tables as DataFrames, -and enables persisting DataFrames back to Phoenix. +# Phoenix5-Spark3 Connector -## Configuring Spark to use the connector +The phoenix5-spark3 plugin extends Phoenix's MapReduce support to allow Spark Review Comment: I realize the we use "Plugin" on the website, but we should standardize on "Connector" ########## phoenix5-spark3/README.md: ########## @@ -15,18 +15,59 @@ See the License for the specific language governing permissions and limitations under the License. --> -phoenix-spark extends Phoenix's MapReduce support to allow Spark to load Phoenix tables as DataFrames, -and enables persisting DataFrames back to Phoenix. +# Phoenix5-Spark3 Connector -## Configuring Spark to use the connector +The phoenix5-spark3 plugin extends Phoenix's MapReduce support to allow Spark + to load Phoenix tables as DataFrames, + and enables persisting DataFrames back to Phoenix. -Use the shaded connector JAR `phoenix5-spark3-shaded-6.0.0-SNAPSHOT.jar` . -Apart from the shaded connector JAR, you also need to add the hbase mapredcp libraries and the hbase configuration directory to the classpath. The final classpath should be something like +## Pre-Requisites -`/etc/hbase/conf:$(hbase mapredcp):phoenix5-spark3-shaded-6.0.0-SNAPSHOT.jar` +* Phoenix 5.1.2+ +* Spark 3.0.3+ -(add the exact paths as appropiate to your system) -Both the `spark.driver.extraClassPath` and `spark.executor.extraClassPath` properties need to be set the above classpath. You may add them spark-defaults.conf, or specify them on the spark-shell or spark-submit command line. +## Why not JDBC + +Although Spark supports connecting directly to JDBC databases, + It’s only able to parallelize queries by partioning on a numeric column. + It also requires a known lower bound, + upper bound and partition count in order to create split queries. + +In contrast, the phoenix-spark integration is able to leverage the underlying + splits provided by Phoenix in order to retrieve and save data across multiple + workers. All that’s required is a database URL and a table name. Review Comment: use "select statement" instead of "table name" ? ########## phoenix5-spark3/README.md: ########## @@ -15,18 +15,59 @@ See the License for the specific language governing permissions and limitations under the License. --> -phoenix-spark extends Phoenix's MapReduce support to allow Spark to load Phoenix tables as DataFrames, -and enables persisting DataFrames back to Phoenix. +# Phoenix5-Spark3 Connector -## Configuring Spark to use the connector +The phoenix5-spark3 plugin extends Phoenix's MapReduce support to allow Spark + to load Phoenix tables as DataFrames, + and enables persisting DataFrames back to Phoenix. -Use the shaded connector JAR `phoenix5-spark3-shaded-6.0.0-SNAPSHOT.jar` . -Apart from the shaded connector JAR, you also need to add the hbase mapredcp libraries and the hbase configuration directory to the classpath. The final classpath should be something like +## Pre-Requisites -`/etc/hbase/conf:$(hbase mapredcp):phoenix5-spark3-shaded-6.0.0-SNAPSHOT.jar` +* Phoenix 5.1.2+ +* Spark 3.0.3+ -(add the exact paths as appropiate to your system) -Both the `spark.driver.extraClassPath` and `spark.executor.extraClassPath` properties need to be set the above classpath. You may add them spark-defaults.conf, or specify them on the spark-shell or spark-submit command line. +## Why not JDBC + +Although Spark supports connecting directly to JDBC databases, + It’s only able to parallelize queries by partioning on a numeric column. + It also requires a known lower bound, + upper bound and partition count in order to create split queries. + +In contrast, the phoenix-spark integration is able to leverage the underlying + splits provided by Phoenix in order to retrieve and save data across multiple + workers. All that’s required is a database URL and a table name. + Optional SELECT columns can be given, + as well as pushdown predicates for efficient filtering. + +The choice of which method to use to access + Phoenix comes down to each specific use case. + +## Setup + +To setup connector add `phoenix5-spark3-shaded` JAR as Review Comment: In most cases, you don't want to add the connector to the maven/compile classpath, it tends to cause conflicts when upgrading. We should move this to emd of the section, and add the caveat that this is only needed for the deprecated usages. ########## phoenix5-spark3/README.md: ########## @@ -39,7 +80,9 @@ UPSERT INTO TABLE1 (ID, COL1) VALUES (2, 'test_row_2'); ``` ### Load as a DataFrame using the DataSourceV2 API + Scala example: + Review Comment: I know you didn't touch that part, but do we still need the SparkContext import ? ########## phoenix5-spark3/README.md: ########## @@ -39,7 +80,9 @@ UPSERT INTO TABLE1 (ID, COL1) VALUES (2, 'test_row_2'); ``` ### Load as a DataFrame using the DataSourceV2 API + Scala example: + Review Comment: Maybe add comments to make it obvious that you need to use a real ZK quorum , like `//replace "phoenix-server:2181" with the real ZK quorum` ########## phoenix5-spark3/README.md: ########## @@ -15,18 +15,59 @@ See the License for the specific language governing permissions and limitations under the License. --> -phoenix-spark extends Phoenix's MapReduce support to allow Spark to load Phoenix tables as DataFrames, -and enables persisting DataFrames back to Phoenix. +# Phoenix5-Spark3 Connector -## Configuring Spark to use the connector +The phoenix5-spark3 plugin extends Phoenix's MapReduce support to allow Spark + to load Phoenix tables as DataFrames, + and enables persisting DataFrames back to Phoenix. -Use the shaded connector JAR `phoenix5-spark3-shaded-6.0.0-SNAPSHOT.jar` . -Apart from the shaded connector JAR, you also need to add the hbase mapredcp libraries and the hbase configuration directory to the classpath. The final classpath should be something like +## Pre-Requisites -`/etc/hbase/conf:$(hbase mapredcp):phoenix5-spark3-shaded-6.0.0-SNAPSHOT.jar` +* Phoenix 5.1.2+ +* Spark 3.0.3+ -(add the exact paths as appropiate to your system) -Both the `spark.driver.extraClassPath` and `spark.executor.extraClassPath` properties need to be set the above classpath. You may add them spark-defaults.conf, or specify them on the spark-shell or spark-submit command line. +## Why not JDBC + +Although Spark supports connecting directly to JDBC databases, + It’s only able to parallelize queries by partioning on a numeric column. + It also requires a known lower bound, + upper bound and partition count in order to create split queries. + +In contrast, the phoenix-spark integration is able to leverage the underlying + splits provided by Phoenix in order to retrieve and save data across multiple + workers. All that’s required is a database URL and a table name. + Optional SELECT columns can be given, + as well as pushdown predicates for efficient filtering. Review Comment: This sounds like you need tospecify the pushdown predicates. Can you rephrase so that it's apparent that pushdown is automatic ? ########## phoenix5-spark3/README.md: ########## @@ -15,18 +15,59 @@ See the License for the specific language governing permissions and limitations under the License. --> -phoenix-spark extends Phoenix's MapReduce support to allow Spark to load Phoenix tables as DataFrames, -and enables persisting DataFrames back to Phoenix. +# Phoenix5-Spark3 Connector -## Configuring Spark to use the connector +The phoenix5-spark3 plugin extends Phoenix's MapReduce support to allow Spark + to load Phoenix tables as DataFrames, + and enables persisting DataFrames back to Phoenix. -Use the shaded connector JAR `phoenix5-spark3-shaded-6.0.0-SNAPSHOT.jar` . -Apart from the shaded connector JAR, you also need to add the hbase mapredcp libraries and the hbase configuration directory to the classpath. The final classpath should be something like +## Pre-Requisites -`/etc/hbase/conf:$(hbase mapredcp):phoenix5-spark3-shaded-6.0.0-SNAPSHOT.jar` +* Phoenix 5.1.2+ +* Spark 3.0.3+ -(add the exact paths as appropiate to your system) -Both the `spark.driver.extraClassPath` and `spark.executor.extraClassPath` properties need to be set the above classpath. You may add them spark-defaults.conf, or specify them on the spark-shell or spark-submit command line. +## Why not JDBC + +Although Spark supports connecting directly to JDBC databases, + It’s only able to parallelize queries by partioning on a numeric column. + It also requires a known lower bound, + upper bound and partition count in order to create split queries. + +In contrast, the phoenix-spark integration is able to leverage the underlying + splits provided by Phoenix in order to retrieve and save data across multiple + workers. All that’s required is a database URL and a table name. + Optional SELECT columns can be given, + as well as pushdown predicates for efficient filtering. + +The choice of which method to use to access + Phoenix comes down to each specific use case. Review Comment: nit: This is super important, and we should have much more on this (though not necessarily in this ticket) ########## phoenix5-spark3/README.md: ########## @@ -15,18 +15,59 @@ See the License for the specific language governing permissions and limitations under the License. --> -phoenix-spark extends Phoenix's MapReduce support to allow Spark to load Phoenix tables as DataFrames, -and enables persisting DataFrames back to Phoenix. +# Phoenix5-Spark3 Connector -## Configuring Spark to use the connector +The phoenix5-spark3 plugin extends Phoenix's MapReduce support to allow Spark + to load Phoenix tables as DataFrames, + and enables persisting DataFrames back to Phoenix. -Use the shaded connector JAR `phoenix5-spark3-shaded-6.0.0-SNAPSHOT.jar` . -Apart from the shaded connector JAR, you also need to add the hbase mapredcp libraries and the hbase configuration directory to the classpath. The final classpath should be something like +## Pre-Requisites -`/etc/hbase/conf:$(hbase mapredcp):phoenix5-spark3-shaded-6.0.0-SNAPSHOT.jar` +* Phoenix 5.1.2+ +* Spark 3.0.3+ -(add the exact paths as appropiate to your system) -Both the `spark.driver.extraClassPath` and `spark.executor.extraClassPath` properties need to be set the above classpath. You may add them spark-defaults.conf, or specify them on the spark-shell or spark-submit command line. +## Why not JDBC + +Although Spark supports connecting directly to JDBC databases, + It’s only able to parallelize queries by partioning on a numeric column. + It also requires a known lower bound, + upper bound and partition count in order to create split queries. + +In contrast, the phoenix-spark integration is able to leverage the underlying + splits provided by Phoenix in order to retrieve and save data across multiple + workers. All that’s required is a database URL and a table name. + Optional SELECT columns can be given, + as well as pushdown predicates for efficient filtering. + +The choice of which method to use to access + Phoenix comes down to each specific use case. + +## Setup Review Comment: Nit: this assumes that Phoenix and HBase/Spark are are both present and configured on the same nodes. Maybe worth mentioning it ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@phoenix.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org