Re: Optimize the type conversion of spark array function and map function in calcite

2024-04-15 Thread Cancai Cai
  Thank you for your suggestion, spark sql is open source

Best wishes,
Cancai Cai

Mihai Budiu  于2024年4月16日周二 00:36写道:

> Is the spark SQL implementation open-source?
> If it is, the algorithm they use may be inferred from the code.
>
> Mihai
> 
> From: Cancai Cai 
> Sent: Monday, April 15, 2024 8:10 AM
> To: dev@calcite.apache.org 
> Subject: Optimize the type conversion of spark array function and map
> function in calcite
>
> Hi, calcite community,
>
> Recently, I am testing the map and array related functions of spark in
> calcite. I found that in some cases, spark is a little different from our
> understanding of type conversion.
>
> For example
>
> scala>  val df = spark.sql("select map_contains_key(map(1, 'a', 2, 'b'),
> 2.0)")
> val df: org.apache.spark.sql.DataFrame = [map_contains_key(map(1, a,
> 2, b), 2.0): boolean]
>
> scala> df.show()
> +--+
> |map_contains_key(map(1, a, 2, b), 2.0)|
> +--+
> |  true|
> +--+
>
> Mihai Budiu pointed out that similar processing may be done in Spark,
>
> map_contains_key(map((Double)1, 'a', (Double)2, 'b'), 2.0)
>
> We can't say that Spark is wrong, we should adapt to this situation, so I
> think I might add an adjustTypeForMapContainsKey method to perform display
> conversion on it, but this situation should not only exist in the
> map_contain_keys method, we cannot guarantee map_concat that they are no
> similar problems with other related functions. Therefore, we should
> discover what common characteristics these functions have in type
> conversion, and we should encapsulate them in a unified method instead of
> adding a similar adjust method to each function.
>
> I thought I should do this in three steps.
>
> ①Test various situations related to the map function and array function in
> Spark, and raise jira if it is inconsistent with the spark behavior in
> calcite
>
> ② Summarize the same characteristics of some functions and find out whether
> there is any relationship
>
> ③For the same characteristics, use a method to encapsulate the type
> conversion。
>
> The above are my personal thoughts. I feel that this may be more conducive
> to the maintenance of calcite code.
>
> Finally, thank you for reading
>
> Best wishes,
>
> Cancai Cai
>


Re: [jira] [Created] (CALCITE-6363) Introduce a rule to derive more filters from inner join condition

2024-04-15 Thread Julian Hyde
James,

Thanks for chiming in. I added your comments to the jira case.

Julian


> On Apr 15, 2024, at 12:58 PM, James Starr  wrote:
> 
> The keyword I think you want is transitive filter pushdown.  The reduce
> expression rule handles some of the trivial cases outlined as examples.
> Also, you will need to simplify the pushed down filters after they are
> extracted to prevent infinite loops.
> 
> Ideally, for the equivalenceMap, an arbitrary subtree that only
> references a single side of the join could be used.
> 
> Example 1:
> SELECT *
> FROM t1, t2
> WHERE subString(t1.zip, 0, 6) = subString(t2.zip, 0, 6)
>  AND subString(t1.zip, 0, 6) IN ()
> 
> On Sat, Apr 13, 2024 at 1:24 AM ruanhui (Jira)  wrote:
> 
>> ruanhui created CALCITE-6363:
>> 
>> 
>> Summary: Introduce a rule to derive more filters from inner
>> join condition
>> Key: CALCITE-6363
>> URL: https://issues.apache.org/jira/browse/CALCITE-6363
>> Project: Calcite
>>  Issue Type: New Feature
>>  Components: core
>>Reporter: ruanhui
>> 
>> 
>> Sometimes we can infer more predicates from inner Join , for example, in
>> the query
>> SELECT * FROM ta INNER JOIN tb ON ta.x = tb.y WHERE ta.x > 10
>> we can infer condition tb.y > 10 and we can push it down to the table tb.
>> In this way, it is possible to reduce the amount of data involved in the
>> Join.
>> 
>> To achieve this, here is my idea.
>> The core data strucature is two Multimap:
>> predicateMap : a map for inputRef to corresponding predicate such as: $1
>> -> [$1 > 10, $1 < 20, $1 = $2]
>> equivalenceMap : a map for inputRef to corresponding equivalent values or
>> inputRefs such as: $1 -> [$2, 1]
>> 
>> The filter derivation is divided into 4 steps:
>> 1. construct predicate map and equivalence map by traversing all
>> conjunctions in the condition
>> 2. search map and rewrite predicates with equivalent inputRefs or literals
>> 2.1 find all inputRefs that are equivalent to the current inputRef, and
>> then rewrite all predicates involving equivalent inputRefs using inputRef,
>> for example if we have inputRef $1 = equivInputRef $2, then we can rewrite
>> \{$2 = 10} to \{$1 = 10}.
>> 2.2 find all predicates involving current inputRef. If any predicate
>> refers to another inputRef, rewrite the predicate with the literal/constant
>> equivalent to that inputRef, such as: if we have inputRef \{$1 > $2} and
>> \{$2 = 10} then we can infer new condition \{$1 > 10}.
>> 2.3 derive new predicates based on equivalence relation in
>> equivalenceMultimap
>> 3. compose all original predicates and derived predicates
>> 4. simplify expression such as range merging, like \{$1 > 10 AND $1 > 20}
>> => \{$1 > 20}, \{$1 > $2 AND $1 > $2} => \{$1 > $2}
>> 
>> Anyone interested in this, please feel free to comment on this issue.
>> 
>> 
>> 
>> --
>> This message was sent by Atlassian Jira
>> (v8.20.10#820010)
>> 



Re: Supporting ASOF JOIN

2024-04-15 Thread Julian Hyde
I would regard this as two separate but related things: a new SQL syntax for 
joins, and a new relational operator. It is definitely worth keeping them 
separate; the operator will not map 1-1 to the syntax, may require its input to 
input to be sorted, and of course we would want queries to be able to use the 
operator even if they didn’t use the syntax.

The relational operator can have physical implementations in various calling 
conventions. Or even flags extending existing algorithms (e.g. add a 
‘keepAtMostOneOnLeft’ flag to EnumerableMergeJoin).

Regarding whether to represent the operator as a subclass of Join or just a 
subclass of BiRel. I recommend making it a subclass of join, but we have to 
take care that rewrite rules and metadata rules designed to apply to regular 
joins do not accidentally apply to these joins. We’ve already done that with 
semi-join, so it shouldn’t be too hard to follow those breadcrumbs.

I recently read “The Complete Story of Joins (in HyPer)”, which contains some 
other interesting and useful join variants: dependent join and mark join. We 
should consider adding these as relational operators, in the same way that we 
add asof-join.

Julian

[1] 
http://btw2017.informatik.uni-stuttgart.de/slidesandpapers/F1-10-37/paper_web.pdf

> On Apr 15, 2024, at 2:19 PM, Mihai Budiu  wrote:
> 
> Hello,
> 
> Seems that this new kind of JOIN named AS OF is very useful for processing 
> time-series data. Here is some example documentation from Snowflake: 
> https://docs.snowflake.com/en/sql-reference/constructs/asof-join
> 
> The semantics is similar to a traditional join, but the result always 
> contains at most one record from the left side, with the last​ matching 
> record on the right side (where "time" is any value that can be compared for 
> inequality). This can be expressed in SQL, but it looks very cumbersome, 
> using a JOIN, a GROUP BY, and then an aggregation to keep the last value.
> 
> I haven't seen anything like that in Calcite, although Calcite does seem to 
> have support for all sorts of temporal and stream notions.
> 
> If one were to implement it, what would be the right way to do it? A subclass 
> of Join? A new type of BiRel RelNode?
> 
> Thank you,
> Mihai



Re: logical-to-logical rewrites

2024-04-15 Thread Mihai Budiu
Mayb the hooks can help you debug what's going on?
https://stackoverflow.com/questions/60091348/is-there-any-way-to-view-the-physical-sqls-executed-by-calcite-jdbc

Mihai

From: Baca Radim 
Sent: Sunday, April 7, 2024 2:58 AM
To: dev@calcite.apache.org 
Subject: logical-to-logical rewrites

Hi everyone!

I'm testing the Calcite heuristic optimizer and would like to prepare
some code samples to demonstrate logical-to-logical rewrites in Calcite.
However, none of the rewrites are applied to my SQL. I'm doing something
wrong. Here is the complete test project:
https://github.com/RadimBaca/Calcite_csv


I'll start with a query

SELECT `D`.`ID`, COUNT(*)

FROM (SELECT *

FROM `PERSON`) AS `P`

INNER JOIN `DEPARTMENT` AS `D` ON `P`.`DEPARTMENT` = `D`.`ID`

WHERE `D`.`CITY` = 'Ostrava'

GROUP BY `D`.`ID`, 23

and the plan

LogicalProject(ID=[$0], EXPR$1=[$2])

   LogicalAggregate(group=[{0, 1}], EXPR$1=[COUNT()])

 LogicalProject(ID=[$4], $f1=[23])

   LogicalFilter(condition=[=($6, 'Ostrava')])

   LogicalJoin(condition=[=($3, $4)], joinType=[inner])

   LogicalProject(ID=[$0], Name=[$1], Salary=[$2], DeparmentID=[$3])

   LogicalTableScan(table=[[Person]])

   LogicalTableScan(table=[[Department]])


I'm using the following list of rules:

planner.addRule(CoreRules.*/PROJECT_TO_CALC/*);

planner.addRule(CoreRules.*/FILTER_TO_CALC/*);

planner.addRule(CoreRules.*/PROJECT_JOIN_TRANSPOSE/*);

planner.addRule(CoreRules.*/FILTER_REDUCE_EXPRESSIONS/*);

planner.addRule(CoreRules.*/CALC_REDUCE_EXPRESSIONS/*);

planner.addRule(CoreRules.*/PROJECT_REMOVE/*);

planner.addRule(CoreRules.*/FILTER_AGGREGATE_TRANSPOSE/*);

planner.addRule(CoreRules.*/AGGREGATE_PROJECT_PULL_UP_CONSTANTS/*);


But the logical plan does not change. I expect it to remove the
unnecessary projection LogicalProject(ID=[$0], Name=[$1], Salary=[$2],
DeparmentID=[$3]) and maybe also the constant 23 in the aggregation. I
would also like to do an Aggregation pushdown; however, I don't know
which rule corresponds to it.


I would be glad for any help.


Thanks,

Radim


Supporting ASOF JOIN

2024-04-15 Thread Mihai Budiu
Hello,

Seems that this new kind of JOIN named AS OF is very useful for processing 
time-series data. Here is some example documentation from Snowflake: 
https://docs.snowflake.com/en/sql-reference/constructs/asof-join

The semantics is similar to a traditional join, but the result always contains 
at most one record from the left side, with the last​ matching record on the 
right side (where "time" is any value that can be compared for inequality). 
This can be expressed in SQL, but it looks very cumbersome, using a JOIN, a 
GROUP BY, and then an aggregation to keep the last value.

I haven't seen anything like that in Calcite, although Calcite does seem to 
have support for all sorts of temporal and stream notions.

If one were to implement it, what would be the right way to do it? A subclass 
of Join? A new type of BiRel RelNode?

Thank you,
Mihai


Re: [jira] [Created] (CALCITE-6363) Introduce a rule to derive more filters from inner join condition

2024-04-15 Thread James Starr
The keyword I think you want is transitive filter pushdown.  The reduce
expression rule handles some of the trivial cases outlined as examples.
Also, you will need to simplify the pushed down filters after they are
extracted to prevent infinite loops.

Ideally, for the equivalenceMap, an arbitrary subtree that only
references a single side of the join could be used.

Example 1:
SELECT *
FROM t1, t2
WHERE subString(t1.zip, 0, 6) = subString(t2.zip, 0, 6)
  AND subString(t1.zip, 0, 6) IN ()

On Sat, Apr 13, 2024 at 1:24 AM ruanhui (Jira)  wrote:

> ruanhui created CALCITE-6363:
> 
>
>  Summary: Introduce a rule to derive more filters from inner
> join condition
>  Key: CALCITE-6363
>  URL: https://issues.apache.org/jira/browse/CALCITE-6363
>  Project: Calcite
>   Issue Type: New Feature
>   Components: core
> Reporter: ruanhui
>
>
> Sometimes we can infer more predicates from inner Join , for example, in
> the query
> SELECT * FROM ta INNER JOIN tb ON ta.x = tb.y WHERE ta.x > 10
> we can infer condition tb.y > 10 and we can push it down to the table tb.
> In this way, it is possible to reduce the amount of data involved in the
> Join.
>
> To achieve this, here is my idea.
> The core data strucature is two Multimap:
> predicateMap : a map for inputRef to corresponding predicate such as: $1
> -> [$1 > 10, $1 < 20, $1 = $2]
> equivalenceMap : a map for inputRef to corresponding equivalent values or
> inputRefs such as: $1 -> [$2, 1]
>
> The filter derivation is divided into 4 steps:
> 1. construct predicate map and equivalence map by traversing all
> conjunctions in the condition
> 2. search map and rewrite predicates with equivalent inputRefs or literals
> 2.1 find all inputRefs that are equivalent to the current inputRef, and
> then rewrite all predicates involving equivalent inputRefs using inputRef,
> for example if we have inputRef $1 = equivInputRef $2, then we can rewrite
> \{$2 = 10} to \{$1 = 10}.
> 2.2 find all predicates involving current inputRef. If any predicate
> refers to another inputRef, rewrite the predicate with the literal/constant
> equivalent to that inputRef, such as: if we have inputRef \{$1 > $2} and
> \{$2 = 10} then we can infer new condition \{$1 > 10}.
> 2.3 derive new predicates based on equivalence relation in
> equivalenceMultimap
> 3. compose all original predicates and derived predicates
> 4. simplify expression such as range merging, like \{$1 > 10 AND $1 > 20}
> => \{$1 > 20}, \{$1 > $2 AND $1 > $2} => \{$1 > $2}
>
> Anyone interested in this, please feel free to comment on this issue.
>
>
>
> --
> This message was sent by Atlassian Jira
> (v8.20.10#820010)
>


Re: Optimize the type conversion of spark array function and map function in calcite

2024-04-15 Thread Mihai Budiu
Is the spark SQL implementation open-source?
If it is, the algorithm they use may be inferred from the code.

Mihai

From: Cancai Cai 
Sent: Monday, April 15, 2024 8:10 AM
To: dev@calcite.apache.org 
Subject: Optimize the type conversion of spark array function and map function 
in calcite

Hi, calcite community,

Recently, I am testing the map and array related functions of spark in
calcite. I found that in some cases, spark is a little different from our
understanding of type conversion.

For example

scala>  val df = spark.sql("select map_contains_key(map(1, 'a', 2, 'b'), 2.0)")
val df: org.apache.spark.sql.DataFrame = [map_contains_key(map(1, a,
2, b), 2.0): boolean]

scala> df.show()
+--+
|map_contains_key(map(1, a, 2, b), 2.0)|
+--+
|  true|
+--+

Mihai Budiu pointed out that similar processing may be done in Spark,

map_contains_key(map((Double)1, 'a', (Double)2, 'b'), 2.0)

We can't say that Spark is wrong, we should adapt to this situation, so I
think I might add an adjustTypeForMapContainsKey method to perform display
conversion on it, but this situation should not only exist in the
map_contain_keys method, we cannot guarantee map_concat that they are no
similar problems with other related functions. Therefore, we should
discover what common characteristics these functions have in type
conversion, and we should encapsulate them in a unified method instead of
adding a similar adjust method to each function.

I thought I should do this in three steps.

①Test various situations related to the map function and array function in
Spark, and raise jira if it is inconsistent with the spark behavior in
calcite

② Summarize the same characteristics of some functions and find out whether
there is any relationship

③For the same characteristics, use a method to encapsulate the type
conversion。

The above are my personal thoughts. I feel that this may be more conducive
to the maintenance of calcite code.

Finally, thank you for reading

Best wishes,

Cancai Cai


[jira] [Created] (CALCITE-6367) Add timezone support for FORMAT clause in CAST (enabled in BigQuery)

2024-04-15 Thread Jerin John (Jira)
Jerin John created CALCITE-6367:
---

 Summary: Add timezone support for FORMAT clause in CAST (enabled 
in BigQuery)
 Key: CALCITE-6367
 URL: https://issues.apache.org/jira/browse/CALCITE-6367
 Project: Calcite
  Issue Type: Bug
Reporter: Jerin John


This issue is a followup on CALCITE-6269 that fixes some of Calcite's existing 
format elements implementation to be aligned to BQ functionality. Two major 
formats that might require a bit more rework is adding support for the TZH/TZM 
elements along with time zone areas as described below:
 * [Parsing timestamp 
literals|https://cloud.google.com/bigquery/docs/reference/standard-sql/format-elements#format_tz_as_string]
 with timezones as used by BQ does not seem to be supported yet (format element 
TZR is unimplemented, BQ has TZH, TZM for hour and minute offsets)
(eg: {{cast('2020.06.03 00:00:53+00' as timestamp format '.MM.DD 
HH:MI:SSTZH')}}

 * BQ format [timezone as string 
|https://cloud.google.com/bigquery/docs/reference/standard-sql/format-elements#format_tz_as_string]
 can take an additional argument {{{}AT TIME ZONE 'Asia/Kolkata'{}}}, which 
would require additional parser changes and time zone parameter to be plumbed 
in to the cast operator call.

One important thing to consider, is that the {{SimpleDateFormat}} class which 
currently stores the datetime object, may not fully support timezone features 
as described and might warrant a broader refactoring of this code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Optimize the type conversion of spark array function and map function in calcite

2024-04-15 Thread Cancai Cai
Hi, calcite community,

Recently, I am testing the map and array related functions of spark in
calcite. I found that in some cases, spark is a little different from our
understanding of type conversion.

For example

scala>  val df = spark.sql("select map_contains_key(map(1, 'a', 2, 'b'), 2.0)")
val df: org.apache.spark.sql.DataFrame = [map_contains_key(map(1, a,
2, b), 2.0): boolean]

scala> df.show()
+--+
|map_contains_key(map(1, a, 2, b), 2.0)|
+--+
|  true|
+--+

Mihai Budiu pointed out that similar processing may be done in Spark,

map_contains_key(map((Double)1, 'a', (Double)2, 'b'), 2.0)

We can't say that Spark is wrong, we should adapt to this situation, so I
think I might add an adjustTypeForMapContainsKey method to perform display
conversion on it, but this situation should not only exist in the
map_contain_keys method, we cannot guarantee map_concat that they are no
similar problems with other related functions. Therefore, we should
discover what common characteristics these functions have in type
conversion, and we should encapsulate them in a unified method instead of
adding a similar adjust method to each function.

I thought I should do this in three steps.

①Test various situations related to the map function and array function in
Spark, and raise jira if it is inconsistent with the spark behavior in
calcite

② Summarize the same characteristics of some functions and find out whether
there is any relationship

③For the same characteristics, use a method to encapsulate the type
conversion。

The above are my personal thoughts. I feel that this may be more conducive
to the maintenance of calcite code.

Finally, thank you for reading

Best wishes,

Cancai Cai


[jira] [Created] (CALCITE-6366) Code generated by EnumUtils#convert should throw an exception if the target type is overflowed

2024-04-15 Thread Ruben Q L (Jira)
Ruben Q L created CALCITE-6366:
--

 Summary: Code generated by EnumUtils#convert should throw an 
exception if the target type is overflowed
 Key: CALCITE-6366
 URL: https://issues.apache.org/jira/browse/CALCITE-6366
 Project: Calcite
  Issue Type: Improvement
  Components: core
Reporter: Ruben Q L


Code generated by EnumUtils#convert should throw an exception if the target 
type is overflowed (consider using Expressions#convertChecked)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (CALCITE-6365) Support for RETURNING clause of JSON_QUERY

2024-04-15 Thread Dawid Wysakowicz (Jira)
Dawid Wysakowicz created CALCITE-6365:
-

 Summary: Support for RETURNING clause of JSON_QUERY
 Key: CALCITE-6365
 URL: https://issues.apache.org/jira/browse/CALCITE-6365
 Project: Calcite
  Issue Type: New Feature
Reporter: Dawid Wysakowicz


SQL standard says {{JSON_QUERY}} should support {{RETURNING}} clause similar to 
{{JSON_VALUE}}. Calcite supports the clause for JSON_VALUE already, but not for 
the JSON_QUERY.

{code}
 ::=
  JSON_QUERY 
  
  [  ]
  [  WRAPPER ]
  [  QUOTES [ ON SCALAR STRING ] ]
  [  ON EMPTY ]
  [  ON ERROR ]
  

 ::=
  RETURNING 
  [ FORMAT  ]
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (CALCITE-6364) HttpClient SPENGO support is deprecated

2024-04-15 Thread Istvan Toth (Jira)
Istvan Toth created CALCITE-6364:


 Summary: HttpClient SPENGO support is deprecated
 Key: CALCITE-6364
 URL: https://issues.apache.org/jira/browse/CALCITE-6364
 Project: Calcite
  Issue Type: Bug
  Components: avatica
Reporter: Istvan Toth


The Avatica Java client depends on Apache HttpClient's Kerberos/SPNEGO 
implementation.
According to HTTPCLIENT-1625 that implementation is not secure, and is 
deprecated in newer versions.

Unfortunately, HTTPCLIENT-1625 is very scant on details, and since the reason 
given for deprecation is the lack of time to fix it, it is likely not a trivial 
fix.

Unfortunately, Avatica depends heavily on httpclient, and replacing it would it 
would be a big job.

While Avatica in theory has a configurable Http Client implementation, the only 
non-httpclient implementation is more of a POC, and does not support ANY 
authentication methods.

I can see these options:

1. Find an another http client library, and use it in Avatica
2. Copy the SPENGO auth code from httpclient, and fix it in Avatica
3. Fix the SPENGO auth code in httpclient.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)