Re: Optimize the type conversion of spark array function and map function in calcite
Thank you for your suggestion, spark sql is open source Best wishes, Cancai Cai Mihai Budiu 于2024年4月16日周二 00:36写道: > Is the spark SQL implementation open-source? > If it is, the algorithm they use may be inferred from the code. > > Mihai > > From: Cancai Cai > Sent: Monday, April 15, 2024 8:10 AM > To: dev@calcite.apache.org > Subject: Optimize the type conversion of spark array function and map > function in calcite > > Hi, calcite community, > > Recently, I am testing the map and array related functions of spark in > calcite. I found that in some cases, spark is a little different from our > understanding of type conversion. > > For example > > scala> val df = spark.sql("select map_contains_key(map(1, 'a', 2, 'b'), > 2.0)") > val df: org.apache.spark.sql.DataFrame = [map_contains_key(map(1, a, > 2, b), 2.0): boolean] > > scala> df.show() > +--+ > |map_contains_key(map(1, a, 2, b), 2.0)| > +--+ > | true| > +--+ > > Mihai Budiu pointed out that similar processing may be done in Spark, > > map_contains_key(map((Double)1, 'a', (Double)2, 'b'), 2.0) > > We can't say that Spark is wrong, we should adapt to this situation, so I > think I might add an adjustTypeForMapContainsKey method to perform display > conversion on it, but this situation should not only exist in the > map_contain_keys method, we cannot guarantee map_concat that they are no > similar problems with other related functions. Therefore, we should > discover what common characteristics these functions have in type > conversion, and we should encapsulate them in a unified method instead of > adding a similar adjust method to each function. > > I thought I should do this in three steps. > > ①Test various situations related to the map function and array function in > Spark, and raise jira if it is inconsistent with the spark behavior in > calcite > > ② Summarize the same characteristics of some functions and find out whether > there is any relationship > > ③For the same characteristics, use a method to encapsulate the type > conversion。 > > The above are my personal thoughts. I feel that this may be more conducive > to the maintenance of calcite code. > > Finally, thank you for reading > > Best wishes, > > Cancai Cai >
Re: [jira] [Created] (CALCITE-6363) Introduce a rule to derive more filters from inner join condition
James, Thanks for chiming in. I added your comments to the jira case. Julian > On Apr 15, 2024, at 12:58 PM, James Starr wrote: > > The keyword I think you want is transitive filter pushdown. The reduce > expression rule handles some of the trivial cases outlined as examples. > Also, you will need to simplify the pushed down filters after they are > extracted to prevent infinite loops. > > Ideally, for the equivalenceMap, an arbitrary subtree that only > references a single side of the join could be used. > > Example 1: > SELECT * > FROM t1, t2 > WHERE subString(t1.zip, 0, 6) = subString(t2.zip, 0, 6) > AND subString(t1.zip, 0, 6) IN () > > On Sat, Apr 13, 2024 at 1:24 AM ruanhui (Jira) wrote: > >> ruanhui created CALCITE-6363: >> >> >> Summary: Introduce a rule to derive more filters from inner >> join condition >> Key: CALCITE-6363 >> URL: https://issues.apache.org/jira/browse/CALCITE-6363 >> Project: Calcite >> Issue Type: New Feature >> Components: core >>Reporter: ruanhui >> >> >> Sometimes we can infer more predicates from inner Join , for example, in >> the query >> SELECT * FROM ta INNER JOIN tb ON ta.x = tb.y WHERE ta.x > 10 >> we can infer condition tb.y > 10 and we can push it down to the table tb. >> In this way, it is possible to reduce the amount of data involved in the >> Join. >> >> To achieve this, here is my idea. >> The core data strucature is two Multimap: >> predicateMap : a map for inputRef to corresponding predicate such as: $1 >> -> [$1 > 10, $1 < 20, $1 = $2] >> equivalenceMap : a map for inputRef to corresponding equivalent values or >> inputRefs such as: $1 -> [$2, 1] >> >> The filter derivation is divided into 4 steps: >> 1. construct predicate map and equivalence map by traversing all >> conjunctions in the condition >> 2. search map and rewrite predicates with equivalent inputRefs or literals >> 2.1 find all inputRefs that are equivalent to the current inputRef, and >> then rewrite all predicates involving equivalent inputRefs using inputRef, >> for example if we have inputRef $1 = equivInputRef $2, then we can rewrite >> \{$2 = 10} to \{$1 = 10}. >> 2.2 find all predicates involving current inputRef. If any predicate >> refers to another inputRef, rewrite the predicate with the literal/constant >> equivalent to that inputRef, such as: if we have inputRef \{$1 > $2} and >> \{$2 = 10} then we can infer new condition \{$1 > 10}. >> 2.3 derive new predicates based on equivalence relation in >> equivalenceMultimap >> 3. compose all original predicates and derived predicates >> 4. simplify expression such as range merging, like \{$1 > 10 AND $1 > 20} >> => \{$1 > 20}, \{$1 > $2 AND $1 > $2} => \{$1 > $2} >> >> Anyone interested in this, please feel free to comment on this issue. >> >> >> >> -- >> This message was sent by Atlassian Jira >> (v8.20.10#820010) >>
Re: Supporting ASOF JOIN
I would regard this as two separate but related things: a new SQL syntax for joins, and a new relational operator. It is definitely worth keeping them separate; the operator will not map 1-1 to the syntax, may require its input to input to be sorted, and of course we would want queries to be able to use the operator even if they didn’t use the syntax. The relational operator can have physical implementations in various calling conventions. Or even flags extending existing algorithms (e.g. add a ‘keepAtMostOneOnLeft’ flag to EnumerableMergeJoin). Regarding whether to represent the operator as a subclass of Join or just a subclass of BiRel. I recommend making it a subclass of join, but we have to take care that rewrite rules and metadata rules designed to apply to regular joins do not accidentally apply to these joins. We’ve already done that with semi-join, so it shouldn’t be too hard to follow those breadcrumbs. I recently read “The Complete Story of Joins (in HyPer)”, which contains some other interesting and useful join variants: dependent join and mark join. We should consider adding these as relational operators, in the same way that we add asof-join. Julian [1] http://btw2017.informatik.uni-stuttgart.de/slidesandpapers/F1-10-37/paper_web.pdf > On Apr 15, 2024, at 2:19 PM, Mihai Budiu wrote: > > Hello, > > Seems that this new kind of JOIN named AS OF is very useful for processing > time-series data. Here is some example documentation from Snowflake: > https://docs.snowflake.com/en/sql-reference/constructs/asof-join > > The semantics is similar to a traditional join, but the result always > contains at most one record from the left side, with the last matching > record on the right side (where "time" is any value that can be compared for > inequality). This can be expressed in SQL, but it looks very cumbersome, > using a JOIN, a GROUP BY, and then an aggregation to keep the last value. > > I haven't seen anything like that in Calcite, although Calcite does seem to > have support for all sorts of temporal and stream notions. > > If one were to implement it, what would be the right way to do it? A subclass > of Join? A new type of BiRel RelNode? > > Thank you, > Mihai
Re: logical-to-logical rewrites
Mayb the hooks can help you debug what's going on? https://stackoverflow.com/questions/60091348/is-there-any-way-to-view-the-physical-sqls-executed-by-calcite-jdbc Mihai From: Baca Radim Sent: Sunday, April 7, 2024 2:58 AM To: dev@calcite.apache.org Subject: logical-to-logical rewrites Hi everyone! I'm testing the Calcite heuristic optimizer and would like to prepare some code samples to demonstrate logical-to-logical rewrites in Calcite. However, none of the rewrites are applied to my SQL. I'm doing something wrong. Here is the complete test project: https://github.com/RadimBaca/Calcite_csv I'll start with a query SELECT `D`.`ID`, COUNT(*) FROM (SELECT * FROM `PERSON`) AS `P` INNER JOIN `DEPARTMENT` AS `D` ON `P`.`DEPARTMENT` = `D`.`ID` WHERE `D`.`CITY` = 'Ostrava' GROUP BY `D`.`ID`, 23 and the plan LogicalProject(ID=[$0], EXPR$1=[$2]) LogicalAggregate(group=[{0, 1}], EXPR$1=[COUNT()]) LogicalProject(ID=[$4], $f1=[23]) LogicalFilter(condition=[=($6, 'Ostrava')]) LogicalJoin(condition=[=($3, $4)], joinType=[inner]) LogicalProject(ID=[$0], Name=[$1], Salary=[$2], DeparmentID=[$3]) LogicalTableScan(table=[[Person]]) LogicalTableScan(table=[[Department]]) I'm using the following list of rules: planner.addRule(CoreRules.*/PROJECT_TO_CALC/*); planner.addRule(CoreRules.*/FILTER_TO_CALC/*); planner.addRule(CoreRules.*/PROJECT_JOIN_TRANSPOSE/*); planner.addRule(CoreRules.*/FILTER_REDUCE_EXPRESSIONS/*); planner.addRule(CoreRules.*/CALC_REDUCE_EXPRESSIONS/*); planner.addRule(CoreRules.*/PROJECT_REMOVE/*); planner.addRule(CoreRules.*/FILTER_AGGREGATE_TRANSPOSE/*); planner.addRule(CoreRules.*/AGGREGATE_PROJECT_PULL_UP_CONSTANTS/*); But the logical plan does not change. I expect it to remove the unnecessary projection LogicalProject(ID=[$0], Name=[$1], Salary=[$2], DeparmentID=[$3]) and maybe also the constant 23 in the aggregation. I would also like to do an Aggregation pushdown; however, I don't know which rule corresponds to it. I would be glad for any help. Thanks, Radim
Supporting ASOF JOIN
Hello, Seems that this new kind of JOIN named AS OF is very useful for processing time-series data. Here is some example documentation from Snowflake: https://docs.snowflake.com/en/sql-reference/constructs/asof-join The semantics is similar to a traditional join, but the result always contains at most one record from the left side, with the last matching record on the right side (where "time" is any value that can be compared for inequality). This can be expressed in SQL, but it looks very cumbersome, using a JOIN, a GROUP BY, and then an aggregation to keep the last value. I haven't seen anything like that in Calcite, although Calcite does seem to have support for all sorts of temporal and stream notions. If one were to implement it, what would be the right way to do it? A subclass of Join? A new type of BiRel RelNode? Thank you, Mihai
Re: [jira] [Created] (CALCITE-6363) Introduce a rule to derive more filters from inner join condition
The keyword I think you want is transitive filter pushdown. The reduce expression rule handles some of the trivial cases outlined as examples. Also, you will need to simplify the pushed down filters after they are extracted to prevent infinite loops. Ideally, for the equivalenceMap, an arbitrary subtree that only references a single side of the join could be used. Example 1: SELECT * FROM t1, t2 WHERE subString(t1.zip, 0, 6) = subString(t2.zip, 0, 6) AND subString(t1.zip, 0, 6) IN () On Sat, Apr 13, 2024 at 1:24 AM ruanhui (Jira) wrote: > ruanhui created CALCITE-6363: > > > Summary: Introduce a rule to derive more filters from inner > join condition > Key: CALCITE-6363 > URL: https://issues.apache.org/jira/browse/CALCITE-6363 > Project: Calcite > Issue Type: New Feature > Components: core > Reporter: ruanhui > > > Sometimes we can infer more predicates from inner Join , for example, in > the query > SELECT * FROM ta INNER JOIN tb ON ta.x = tb.y WHERE ta.x > 10 > we can infer condition tb.y > 10 and we can push it down to the table tb. > In this way, it is possible to reduce the amount of data involved in the > Join. > > To achieve this, here is my idea. > The core data strucature is two Multimap: > predicateMap : a map for inputRef to corresponding predicate such as: $1 > -> [$1 > 10, $1 < 20, $1 = $2] > equivalenceMap : a map for inputRef to corresponding equivalent values or > inputRefs such as: $1 -> [$2, 1] > > The filter derivation is divided into 4 steps: > 1. construct predicate map and equivalence map by traversing all > conjunctions in the condition > 2. search map and rewrite predicates with equivalent inputRefs or literals > 2.1 find all inputRefs that are equivalent to the current inputRef, and > then rewrite all predicates involving equivalent inputRefs using inputRef, > for example if we have inputRef $1 = equivInputRef $2, then we can rewrite > \{$2 = 10} to \{$1 = 10}. > 2.2 find all predicates involving current inputRef. If any predicate > refers to another inputRef, rewrite the predicate with the literal/constant > equivalent to that inputRef, such as: if we have inputRef \{$1 > $2} and > \{$2 = 10} then we can infer new condition \{$1 > 10}. > 2.3 derive new predicates based on equivalence relation in > equivalenceMultimap > 3. compose all original predicates and derived predicates > 4. simplify expression such as range merging, like \{$1 > 10 AND $1 > 20} > => \{$1 > 20}, \{$1 > $2 AND $1 > $2} => \{$1 > $2} > > Anyone interested in this, please feel free to comment on this issue. > > > > -- > This message was sent by Atlassian Jira > (v8.20.10#820010) >
Re: Optimize the type conversion of spark array function and map function in calcite
Is the spark SQL implementation open-source? If it is, the algorithm they use may be inferred from the code. Mihai From: Cancai Cai Sent: Monday, April 15, 2024 8:10 AM To: dev@calcite.apache.org Subject: Optimize the type conversion of spark array function and map function in calcite Hi, calcite community, Recently, I am testing the map and array related functions of spark in calcite. I found that in some cases, spark is a little different from our understanding of type conversion. For example scala> val df = spark.sql("select map_contains_key(map(1, 'a', 2, 'b'), 2.0)") val df: org.apache.spark.sql.DataFrame = [map_contains_key(map(1, a, 2, b), 2.0): boolean] scala> df.show() +--+ |map_contains_key(map(1, a, 2, b), 2.0)| +--+ | true| +--+ Mihai Budiu pointed out that similar processing may be done in Spark, map_contains_key(map((Double)1, 'a', (Double)2, 'b'), 2.0) We can't say that Spark is wrong, we should adapt to this situation, so I think I might add an adjustTypeForMapContainsKey method to perform display conversion on it, but this situation should not only exist in the map_contain_keys method, we cannot guarantee map_concat that they are no similar problems with other related functions. Therefore, we should discover what common characteristics these functions have in type conversion, and we should encapsulate them in a unified method instead of adding a similar adjust method to each function. I thought I should do this in three steps. ①Test various situations related to the map function and array function in Spark, and raise jira if it is inconsistent with the spark behavior in calcite ② Summarize the same characteristics of some functions and find out whether there is any relationship ③For the same characteristics, use a method to encapsulate the type conversion。 The above are my personal thoughts. I feel that this may be more conducive to the maintenance of calcite code. Finally, thank you for reading Best wishes, Cancai Cai
[jira] [Created] (CALCITE-6367) Add timezone support for FORMAT clause in CAST (enabled in BigQuery)
Jerin John created CALCITE-6367: --- Summary: Add timezone support for FORMAT clause in CAST (enabled in BigQuery) Key: CALCITE-6367 URL: https://issues.apache.org/jira/browse/CALCITE-6367 Project: Calcite Issue Type: Bug Reporter: Jerin John This issue is a followup on CALCITE-6269 that fixes some of Calcite's existing format elements implementation to be aligned to BQ functionality. Two major formats that might require a bit more rework is adding support for the TZH/TZM elements along with time zone areas as described below: * [Parsing timestamp literals|https://cloud.google.com/bigquery/docs/reference/standard-sql/format-elements#format_tz_as_string] with timezones as used by BQ does not seem to be supported yet (format element TZR is unimplemented, BQ has TZH, TZM for hour and minute offsets) (eg: {{cast('2020.06.03 00:00:53+00' as timestamp format '.MM.DD HH:MI:SSTZH')}} * BQ format [timezone as string |https://cloud.google.com/bigquery/docs/reference/standard-sql/format-elements#format_tz_as_string] can take an additional argument {{{}AT TIME ZONE 'Asia/Kolkata'{}}}, which would require additional parser changes and time zone parameter to be plumbed in to the cast operator call. One important thing to consider, is that the {{SimpleDateFormat}} class which currently stores the datetime object, may not fully support timezone features as described and might warrant a broader refactoring of this code. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Optimize the type conversion of spark array function and map function in calcite
Hi, calcite community, Recently, I am testing the map and array related functions of spark in calcite. I found that in some cases, spark is a little different from our understanding of type conversion. For example scala> val df = spark.sql("select map_contains_key(map(1, 'a', 2, 'b'), 2.0)") val df: org.apache.spark.sql.DataFrame = [map_contains_key(map(1, a, 2, b), 2.0): boolean] scala> df.show() +--+ |map_contains_key(map(1, a, 2, b), 2.0)| +--+ | true| +--+ Mihai Budiu pointed out that similar processing may be done in Spark, map_contains_key(map((Double)1, 'a', (Double)2, 'b'), 2.0) We can't say that Spark is wrong, we should adapt to this situation, so I think I might add an adjustTypeForMapContainsKey method to perform display conversion on it, but this situation should not only exist in the map_contain_keys method, we cannot guarantee map_concat that they are no similar problems with other related functions. Therefore, we should discover what common characteristics these functions have in type conversion, and we should encapsulate them in a unified method instead of adding a similar adjust method to each function. I thought I should do this in three steps. ①Test various situations related to the map function and array function in Spark, and raise jira if it is inconsistent with the spark behavior in calcite ② Summarize the same characteristics of some functions and find out whether there is any relationship ③For the same characteristics, use a method to encapsulate the type conversion。 The above are my personal thoughts. I feel that this may be more conducive to the maintenance of calcite code. Finally, thank you for reading Best wishes, Cancai Cai
[jira] [Created] (CALCITE-6366) Code generated by EnumUtils#convert should throw an exception if the target type is overflowed
Ruben Q L created CALCITE-6366: -- Summary: Code generated by EnumUtils#convert should throw an exception if the target type is overflowed Key: CALCITE-6366 URL: https://issues.apache.org/jira/browse/CALCITE-6366 Project: Calcite Issue Type: Improvement Components: core Reporter: Ruben Q L Code generated by EnumUtils#convert should throw an exception if the target type is overflowed (consider using Expressions#convertChecked) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (CALCITE-6365) Support for RETURNING clause of JSON_QUERY
Dawid Wysakowicz created CALCITE-6365: - Summary: Support for RETURNING clause of JSON_QUERY Key: CALCITE-6365 URL: https://issues.apache.org/jira/browse/CALCITE-6365 Project: Calcite Issue Type: New Feature Reporter: Dawid Wysakowicz SQL standard says {{JSON_QUERY}} should support {{RETURNING}} clause similar to {{JSON_VALUE}}. Calcite supports the clause for JSON_VALUE already, but not for the JSON_QUERY. {code} ::= JSON_QUERY [ ] [ WRAPPER ] [ QUOTES [ ON SCALAR STRING ] ] [ ON EMPTY ] [ ON ERROR ] ::= RETURNING [ FORMAT ] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (CALCITE-6364) HttpClient SPENGO support is deprecated
Istvan Toth created CALCITE-6364: Summary: HttpClient SPENGO support is deprecated Key: CALCITE-6364 URL: https://issues.apache.org/jira/browse/CALCITE-6364 Project: Calcite Issue Type: Bug Components: avatica Reporter: Istvan Toth The Avatica Java client depends on Apache HttpClient's Kerberos/SPNEGO implementation. According to HTTPCLIENT-1625 that implementation is not secure, and is deprecated in newer versions. Unfortunately, HTTPCLIENT-1625 is very scant on details, and since the reason given for deprecation is the lack of time to fix it, it is likely not a trivial fix. Unfortunately, Avatica depends heavily on httpclient, and replacing it would it would be a big job. While Avatica in theory has a configurable Http Client implementation, the only non-httpclient implementation is more of a POC, and does not support ANY authentication methods. I can see these options: 1. Find an another http client library, and use it in Avatica 2. Copy the SPENGO auth code from httpclient, and fix it in Avatica 3. Fix the SPENGO auth code in httpclient. -- This message was sent by Atlassian Jira (v8.20.10#820010)