jacques-n commented on a change in pull request #10979: URL: https://github.com/apache/arrow/pull/10979#discussion_r696135671
########## File path: format/IRFunction.fbs ########## @@ -0,0 +1,68 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +include "Schema.fbs"; + +namespace org.apache.arrow.ir.flatbuf; + +// A unique identifier for a particular function definition. +table FunctionId { + // the function description identifier. + id: uint32; + + // The origin of the function definition. Should be mapped to a domain such as "org.apache.arrow" + // The github definition paths for definitions are defined in + // github.com/apache/arrow/format/functions.txt. Org 0 == Apache Arrow canonical definitions. + // Organizations < 1B are canonical organizations. For private functions, use + // Org id > 1B Review comment: I think you may be interpreting this different from my intention. Here is what I would expect to be the case: - Function signature is different from implementation. Signature defines the semantics on the function. Each function signature has an orgId:functionId identifier. The route to the signature and semantics would be either: - arrow org id => arrow function/semantics doc (e.g. yaml) - arrow org list => third party function signature/semantics doc - A consumer of the plan (e.g. an execution engine) would try reject a plan if any orgId:functionId identifiers are not implemented. If they are all implemented, the consumer should be able to complete the plan. - Known orgids are defined in the apache arrow repository so that people can go somewhere to get the specification for - We could, at some point, have a way to map function signatures to implementations but initially, I wouldn't try to build this into the specifivation. - In Arrow there would be ~2000-4000 well defined "sql" functions signatures that are defined. One or more of the language implementations will have their own way to bind from signature to function implementation. - Producer and consumers could coordinate their available signatures. For example, a SQL parsing layer to could probe an engine to get a list of available function signatures. Then the parsing layer could validate/parse the entire tree even though it didn't previously know about a particular function (for example udfs registered with a consumer). So to your questions: > I.E. say there's some arbitrary function that is standardized like functionA and there's both a C++ and Rust implementation of it. Would they have the same function IDs in different orgs or would they have the same org with different function IDs? If functionA is defined at the Arrow project level, I'd expect a single function signature with two implementations. Each engine would responsible for binding to the correct implementation but the function would have a well defined semantic and would be expected to behave the same way. If functionA is some arbitrary third party function, you'd likely still have a single function signature, it would just be namespaced outside the project (via an orgid pointer). > How would an IR producer allow arbitrarily targeting either implementation? I would expect a plan to be targeted to a single consumer. So you would submit to whatever consumer you wanted to use. If you wanted to split a plan so some was run in one consumer and some was run in a different consumer, I'd see that as a consumer that knew how to split a plan and then submitted plan segments to different sub-consumers. In other words, I don't think the plan itself should be expressing what consumer should be used. Maybe I'm not understand what exact situation you're trying to solve that I didn't outline above. In general, I assume that you may stack a number of "filters" between your original producer and one or more ultimate consumers but that this is based on the intelligence of each consumer, not something in the plan primitives. For example, I could see something like this: ``` root producer | plan optimizer | engine 0 / \ engine 1 engine 2 ``` In this example, engine 0 is a coordinating engine and further delegates plan segments to engine 1 and 2. Each engine then makes decisions about implementation of a function signature. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
