Re: Using pyspark connect PQS with sqlContext

2024-01-01 Thread Istvan Toth
Yes, if you add the thin client JAR to Spark, then you should be able to
use it as any other generic JDBC DataSource.
The ZK port is not relevant to the thin client.

Istvan

On Mon, Dec 25, 2023 at 10:06 AM Cong Luo  wrote:

>
> Thanks, Istvan.
>
> Is it possible to connect pqs as if we were using pyspark context to
> connect mysql? Some of these scenes: 1. unable to provide ZK port. 2. do
> not need to read phoenix in parallel, just read as an input to pyspark
> context.
>
> On 2023/12/23 21:10:24 Istvan Toth wrote:
> > You can't.
> > Thin client can only be used as a generic JDBC data source in Spark.
> >
> > The point of the connector is improving performance by spreading out the
> > query with the Spark/MR integration, but the thin client only talks to
> the
> > pqs server, and cannot access the cluster otherwise.
> >
> > Istvan
> >
> >
> > On Fri, Dec 22, 2023 at 4:58 AM luoc  wrote:
> >
> > > Hi all,
> > >
> > > How can I using pyspark connect PQS with sqlContext?
> > >
> > > // fat client
> > > df = sqlContext.read \
> > >   .format("org.apache.phoenix.spark") \
> > >   .option("table", "TABLE1") \
> > >   .option("zkUrl", "localhost:2181") \
> > >   .load()
> > >
> > > How to do this using the thin client?
> > >
> >
> >
> > --
> > *István Tóth* | Sr. Staff Software Engineer
> > *Email*: st...@cloudera.com
> > cloudera.com 
> > [image: Cloudera] 
> > [image: Cloudera on Twitter]  [image:
> > Cloudera on Facebook]  [image:
> Cloudera
> > on LinkedIn] 
> > --
> > --
> >
>


-- 
*István Tóth* | Sr. Staff Software Engineer
*Email*: st...@cloudera.com
cloudera.com 
[image: Cloudera] 
[image: Cloudera on Twitter]  [image:
Cloudera on Facebook]  [image: Cloudera
on LinkedIn] 
--
--


[jira] [Commented] (OMID-254) Upgrade to phoenix-thirdparty 2.1.0

2024-01-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OMID-254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801677#comment-17801677
 ] 

ASF GitHub Bot commented on OMID-254:
-

stoty commented on PR #151:
URL: https://github.com/apache/phoenix-omid/pull/151#issuecomment-1873673603

   Have you run the full Phoenix test suite with both Phoenix and Omid built 
with the new thirdparty, @NihalJain  ?




> Upgrade to phoenix-thirdparty 2.1.0
> ---
>
> Key: OMID-254
> URL: https://issues.apache.org/jira/browse/OMID-254
> Project: Phoenix Omid
>  Issue Type: Sub-task
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
>
> Phoenix-thirdparty has been released, see 
> [https://www.mail-archive.com/user@phoenix.apache.org/msg08204.html]
> {quote}The recent release has upgraded Guava to version 32.1.3-jre from the 
> previous 31.0.1-android version. Initially, the 4.x branch maintained 
> compatibility with Java 7, necessitating the use of the Android variant of 
> Guava. However, with the end-of-life (EOL) status of the 4.x branch, the move 
> to the standard JRE version of Guava signifies a shift in compatibility 
> standards
> {quote}
> It's time we bump up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PHOENIX-7163) Do Not Dependency Manage commons-configuration2 Version

2024-01-01 Thread Istvan Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-7163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Toth reassigned PHOENIX-7163:


Assignee: Istvan Toth

> Do Not Dependency Manage commons-configuration2 Version
> ---
>
> Key: PHOENIX-7163
> URL: https://issues.apache.org/jira/browse/PHOENIX-7163
> Project: Phoenix
>  Issue Type: Bug
>  Components: core
>Affects Versions: 5.2.0, 5.1.4
>Reporter: Istvan Toth
>Assignee: Istvan Toth
>Priority: Major
>
> We are using commons-configurations2 for the Hadoop metrics code, because 
> that Hadoop API is badly broken.
> Because of this, I have added dependency management for that dependency.
> We are setting an old version, which is known to have CVEs.
> Remove the dependency managment so that we can pick up any possible future 
> fixes from Hadoop instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PHOENIX-7163) Do Not Dependency Manage commons-configuration2 Version

2024-01-01 Thread Istvan Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-7163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Toth updated PHOENIX-7163:
-
Summary: Do Not Dependency Manage commons-configuration2 Version  (was: Do 
not dependency manage commons-configuration2 version)

> Do Not Dependency Manage commons-configuration2 Version
> ---
>
> Key: PHOENIX-7163
> URL: https://issues.apache.org/jira/browse/PHOENIX-7163
> Project: Phoenix
>  Issue Type: Bug
>  Components: core
>Affects Versions: 5.2.0, 5.1.4
>Reporter: Istvan Toth
>Priority: Major
>
> We are using commons-configurations2 for the Hadoop metrics code, because 
> that Hadoop API is badly broken.
> Because of this, I have added dependency management for that dependency.
> We are setting an old version, which is known to have CVEs.
> Remove the dependency managment so that we can pick up any possible future 
> fixes from Hadoop instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PHOENIX-7163) Do not dependency manage commons-configuration2 version

2024-01-01 Thread Istvan Toth (Jira)
Istvan Toth created PHOENIX-7163:


 Summary: Do not dependency manage commons-configuration2 version
 Key: PHOENIX-7163
 URL: https://issues.apache.org/jira/browse/PHOENIX-7163
 Project: Phoenix
  Issue Type: Bug
  Components: core
Affects Versions: 5.2.0, 5.1.4
Reporter: Istvan Toth


We are using commons-configurations2 for the Hadoop metrics code, because that 
Hadoop API is badly broken.

Because of this, I have added dependency management for that dependency.

We are setting an old version, which is known to have CVEs.

Remove the dependency managment so that we can pick up any possible future 
fixes from Hadoop instead.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OMID-254) Upgrade to phoenix-thirdparty 2.1.0

2024-01-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OMID-254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801686#comment-17801686
 ] 

ASF GitHub Bot commented on OMID-254:
-

stoty commented on PR #151:
URL: https://github.com/apache/phoenix-omid/pull/151#issuecomment-1873692027

   No need, if you have tested with thirdparty-2.1 in both Omid and Phoenix at 
the same time, that's fine.
   




> Upgrade to phoenix-thirdparty 2.1.0
> ---
>
> Key: OMID-254
> URL: https://issues.apache.org/jira/browse/OMID-254
> Project: Phoenix Omid
>  Issue Type: Sub-task
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
>
> Phoenix-thirdparty has been released, see 
> [https://www.mail-archive.com/user@phoenix.apache.org/msg08204.html]
> {quote}The recent release has upgraded Guava to version 32.1.3-jre from the 
> previous 31.0.1-android version. Initially, the 4.x branch maintained 
> compatibility with Java 7, necessitating the use of the Android variant of 
> Guava. However, with the end-of-life (EOL) status of the 4.x branch, the move 
> to the standard JRE version of Guava signifies a shift in compatibility 
> standards
> {quote}
> It's time we bump up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (OMID-254) Upgrade to phoenix-thirdparty 2.1.0

2024-01-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/OMID-254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801681#comment-17801681
 ] 

ASF GitHub Bot commented on OMID-254:
-

NihalJain commented on PR #151:
URL: https://github.com/apache/phoenix-omid/pull/151#issuecomment-1873681962

   > Have you run the full Phoenix test suite with both Phoenix and Omid built 
with the new thirdparty,@NihalJain ?
   
   Hi @stoty I have ran following for omid.
   
   > Built code locally and ran tests.:
   > 
   > ```
   > mvn clean install -Dhbase.version=2.5.6-hadoop3 -DskipTests
   > mvn verify -Dsurefire.rerunFailingTestsCount=5 
-Dhbase.version=2.5.6-hadoop3
   > ```
   
   Also had run tests for phoenix with 
https://github.com/apache/phoenix-thirdparty/pull/8#issuecomment-1832165125




> Upgrade to phoenix-thirdparty 2.1.0
> ---
>
> Key: OMID-254
> URL: https://issues.apache.org/jira/browse/OMID-254
> Project: Phoenix Omid
>  Issue Type: Sub-task
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
>
> Phoenix-thirdparty has been released, see 
> [https://www.mail-archive.com/user@phoenix.apache.org/msg08204.html]
> {quote}The recent release has upgraded Guava to version 32.1.3-jre from the 
> previous 31.0.1-android version. Initially, the 4.x branch maintained 
> compatibility with Java 7, necessitating the use of the Android variant of 
> Guava. However, with the end-of-life (EOL) status of the 4.x branch, the move 
> to the standard JRE version of Guava signifies a shift in compatibility 
> standards
> {quote}
> It's time we bump up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (PHOENIX-7001) Change Data Capture leveraging Max Lookback and Uncovered Indexes

2024-01-01 Thread Hari Krishna Dara (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-7001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Krishna Dara reopened PHOENIX-7001:


Resolved wrong item.

> Change Data Capture leveraging Max Lookback and Uncovered Indexes
> -
>
> Key: PHOENIX-7001
> URL: https://issues.apache.org/jira/browse/PHOENIX-7001
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Kadir Ozdemir
>Priority: Major
>
> The use cases for a Change Data Capture (CDC) feature are centered around 
> capturing changes to a given table (or updatable view) as these changes 
> happen in near real-time. A CDC application can retrieve changes in real-time 
> or with some delay, or even retrieves the same set of changes multiple times. 
> This means the CDC use case can be generalized as time range queries where 
> the time range is typically short such as last x minutes or hours or 
> expressed as a specific time range in the last n days where n is typically 
> less than 7.
> A change is an update in a row. That is, a change is either updating one or 
> more columns of a table for a given row or deleting a row. It is desirable to 
> provide these changes in the order of their arrival. One can visualize the 
> delivery of these changes through a stream from a Phoenix table to the 
> application that is initiated by the application similar to the delivery of 
> any other Phoenix query results. The difference is that a regular query 
> result includes at most one result row for each row satisfying the query and 
> the deleted rows are not visible to the query result while the CDC 
> stream/result can include multiple result rows for each row and the result 
> includes deleted rows. Some use cases need to also get the pre and/or post 
> image of the row along with a change on the row. 
> The design proposed here leverages Phoenix Max Lookback and Uncovered (Global 
> or Local) Indexes. The max lookback feature retains recent changes to a 
> table, that is, the changes that have been done in the last x days typically. 
> This means that the max lookback feature already captures the changes to a 
> given table. Currently, the max lookback age is configurable at the cluster 
> level. We need to extend this capability to be able to configure the max 
> lookback age at the table level so that each table can have a different max 
> lookback age based on its CDC application requirements.
> To deliver the changes in the order of their arrival, we need a time based 
> index. This index should be uncovered as the changes are already retained in 
> the table by the max lookback feature. The arrival time can be defined as the 
> mutation timestamp generated by the server, or a user-specified timestamp (or 
> any other long integer) column. An uncovered index would allow us to 
> efficiently and orderly access to the changes. Changes to an index table are 
> also preserved by the max lookback feature.
> A CDC feature can be composed of the following components:
>  * {*}CDCUncoveredIndexRegionScanner{*}: This is a server side scanner on an 
> uncovered index used for CDC. This can inherit UncoveredIndexRegionScanner. 
> It goes through index table rows using a raw scan to identify data table rows 
> and retrieves these rows using a raw scan. Using the time range, it forms a 
> JSON blob to represent changes to the row including pre and/or post row 
> images.
>  * {*}CDC Query Compiler{*}: This is a client side component. It prepares the 
> scan object based on the given CDC query statement. 
>  * {*}CDC DDL Compiler{*}: This is a client side component. It creates the 
> time based uncovered (global/local) index based on the given CDC DDL 
> statement and a virtual table of CDC type. CDC will be a new table type. 
> A CDC DDL syntax to create CDC on a (data) table can be as follows: 
> Create CDC  on  (PHOENIX_ROW_TIMESTAMP()  | 
> ) INCLUDE (pre | post | latest | all) TTL =  seconds> INDEX =  SALT_BUCKETS=
> The above CDC DDL creates a virtual CDC table and an uncovered index. The CDC 
> table PK columns start with the timestamp or user defined column and continue 
> with the data table PK columns. The CDC table includes one non-PK column 
> which is a JSON column. The change is expressed in this JSON column in 
> multiple ways based on the CDC DDL or query statement. The change can be 
> expressed as just the mutation for the change, the latest image of the row, 
> the pre image of the row (the image before the change), the post image, or 
> any combination of these. The CDC table is not a physical table on disk. It 
> is just a virtual table to be used in a CDC query. Phoenix stores just the 
> metadata for this virtual table. 
> A CDC query can be as follow:
> Select * from  where PHOENIX_ROW_TIMESTAMP() >= TO_DATE( …) 
> AND 

[jira] [Resolved] (PHOENIX-7001) Change Data Capture leveraging Max Lookback and Uncovered Indexes

2024-01-01 Thread Hari Krishna Dara (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-7001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Krishna Dara resolved PHOENIX-7001.

Resolution: Fixed

Change merged into the feature branch.

> Change Data Capture leveraging Max Lookback and Uncovered Indexes
> -
>
> Key: PHOENIX-7001
> URL: https://issues.apache.org/jira/browse/PHOENIX-7001
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Kadir Ozdemir
>Priority: Major
>
> The use cases for a Change Data Capture (CDC) feature are centered around 
> capturing changes to a given table (or updatable view) as these changes 
> happen in near real-time. A CDC application can retrieve changes in real-time 
> or with some delay, or even retrieves the same set of changes multiple times. 
> This means the CDC use case can be generalized as time range queries where 
> the time range is typically short such as last x minutes or hours or 
> expressed as a specific time range in the last n days where n is typically 
> less than 7.
> A change is an update in a row. That is, a change is either updating one or 
> more columns of a table for a given row or deleting a row. It is desirable to 
> provide these changes in the order of their arrival. One can visualize the 
> delivery of these changes through a stream from a Phoenix table to the 
> application that is initiated by the application similar to the delivery of 
> any other Phoenix query results. The difference is that a regular query 
> result includes at most one result row for each row satisfying the query and 
> the deleted rows are not visible to the query result while the CDC 
> stream/result can include multiple result rows for each row and the result 
> includes deleted rows. Some use cases need to also get the pre and/or post 
> image of the row along with a change on the row. 
> The design proposed here leverages Phoenix Max Lookback and Uncovered (Global 
> or Local) Indexes. The max lookback feature retains recent changes to a 
> table, that is, the changes that have been done in the last x days typically. 
> This means that the max lookback feature already captures the changes to a 
> given table. Currently, the max lookback age is configurable at the cluster 
> level. We need to extend this capability to be able to configure the max 
> lookback age at the table level so that each table can have a different max 
> lookback age based on its CDC application requirements.
> To deliver the changes in the order of their arrival, we need a time based 
> index. This index should be uncovered as the changes are already retained in 
> the table by the max lookback feature. The arrival time can be defined as the 
> mutation timestamp generated by the server, or a user-specified timestamp (or 
> any other long integer) column. An uncovered index would allow us to 
> efficiently and orderly access to the changes. Changes to an index table are 
> also preserved by the max lookback feature.
> A CDC feature can be composed of the following components:
>  * {*}CDCUncoveredIndexRegionScanner{*}: This is a server side scanner on an 
> uncovered index used for CDC. This can inherit UncoveredIndexRegionScanner. 
> It goes through index table rows using a raw scan to identify data table rows 
> and retrieves these rows using a raw scan. Using the time range, it forms a 
> JSON blob to represent changes to the row including pre and/or post row 
> images.
>  * {*}CDC Query Compiler{*}: This is a client side component. It prepares the 
> scan object based on the given CDC query statement. 
>  * {*}CDC DDL Compiler{*}: This is a client side component. It creates the 
> time based uncovered (global/local) index based on the given CDC DDL 
> statement and a virtual table of CDC type. CDC will be a new table type. 
> A CDC DDL syntax to create CDC on a (data) table can be as follows: 
> Create CDC  on  (PHOENIX_ROW_TIMESTAMP()  | 
> ) INCLUDE (pre | post | latest | all) TTL =  seconds> INDEX =  SALT_BUCKETS=
> The above CDC DDL creates a virtual CDC table and an uncovered index. The CDC 
> table PK columns start with the timestamp or user defined column and continue 
> with the data table PK columns. The CDC table includes one non-PK column 
> which is a JSON column. The change is expressed in this JSON column in 
> multiple ways based on the CDC DDL or query statement. The change can be 
> expressed as just the mutation for the change, the latest image of the row, 
> the pre image of the row (the image before the change), the post image, or 
> any combination of these. The CDC table is not a physical table on disk. It 
> is just a virtual table to be used in a CDC query. Phoenix stores just the 
> metadata for this virtual table. 
> A CDC query can be as follow:
> Select * from  where 

[jira] [Resolved] (PHOENIX-7014) CDC query complier and optimizer

2024-01-01 Thread Hari Krishna Dara (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-7014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Krishna Dara resolved PHOENIX-7014.

Resolution: Fixed

PR: [https://github.com/apache/phoenix/pull/1766]

Merged into the feature branch.

> CDC query complier and optimizer
> 
>
> Key: PHOENIX-7014
> URL: https://issues.apache.org/jira/browse/PHOENIX-7014
> Project: Phoenix
>  Issue Type: Sub-task
>Reporter: Viraj Jasani
>Assignee: Hari Krishna Dara
>Priority: Major
>
> For CDC table type, the query optimizer should be able to query from the 
> uncovered global index table with data table associated with the given CDC 
> table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PHOENIX-7013) CDC DQL Select query parser

2024-01-01 Thread Hari Krishna Dara (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-7013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Krishna Dara reassigned PHOENIX-7013:
--

Assignee: Hari Krishna Dara

> CDC DQL Select query parser
> ---
>
> Key: PHOENIX-7013
> URL: https://issues.apache.org/jira/browse/PHOENIX-7013
> Project: Phoenix
>  Issue Type: Sub-task
>Reporter: Viraj Jasani
>Assignee: Hari Krishna Dara
>Priority: Major
>
> The purpose of this sub-task is to provide DQL query capability for CDC 
> (Change Data Capture) feature.
> The SELECT query parser can identify the given CDC table based on the table 
> type defined in SYSTEM.CATALOG and it should be able to parse qualifiers (pre 
> | post | latest | all) from the query.
> CDC DQL query sample:
>  
> {code:java}
> Select * from  where PHOENIX_ROW_TIMESTAMP() >= TO_DATE( …) 
> AND PHOENIX_ROW_TIMESTAMP() < TO_DATE( …)
> {code}
> This query would return the rows of the CDC table. The above select query can 
> be hinted at by using a new CDC hint to return just the actual change, pre, 
> post, or latest image of the row, or a combination of them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PHOENIX-7013) CDC DQL Select query parser

2024-01-01 Thread Hari Krishna Dara (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-7013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Krishna Dara resolved PHOENIX-7013.

Resolution: Fixed

PR: [https://github.com/apache/phoenix/pull/1766]

Change has been merged into the feature branch.

> CDC DQL Select query parser
> ---
>
> Key: PHOENIX-7013
> URL: https://issues.apache.org/jira/browse/PHOENIX-7013
> Project: Phoenix
>  Issue Type: Sub-task
>Reporter: Viraj Jasani
>Assignee: Hari Krishna Dara
>Priority: Major
>
> The purpose of this sub-task is to provide DQL query capability for CDC 
> (Change Data Capture) feature.
> The SELECT query parser can identify the given CDC table based on the table 
> type defined in SYSTEM.CATALOG and it should be able to parse qualifiers (pre 
> | post | latest | all) from the query.
> CDC DQL query sample:
>  
> {code:java}
> Select * from  where PHOENIX_ROW_TIMESTAMP() >= TO_DATE( …) 
> AND PHOENIX_ROW_TIMESTAMP() < TO_DATE( …)
> {code}
> This query would return the rows of the CDC table. The above select query can 
> be hinted at by using a new CDC hint to return just the actual change, pre, 
> post, or latest image of the row, or a combination of them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)