[GitHub] [arrow] jacques-n commented on a change in pull request #10979: [RFC] Alternative IR approach

GitBox Wed, 25 Aug 2021 14:45:36 -0700


jacques-n commented on a change in pull request #10979:
URL: https://github.com/apache/arrow/pull/10979#discussion_r696135671




##########
File path: format/IRFunction.fbs
##########
@@ -0,0 +1,68 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+include "Schema.fbs";
+
+namespace org.apache.arrow.ir.flatbuf;
+
+// A unique identifier for a particular function definition. 
+table FunctionId {
+  // the function description identifier.
+  id: uint32;
+  
+  // The origin of the function definition. Should be mapped to a domain such 
as "org.apache.arrow"
+  // The github definition paths for definitions are defined in 
+  // github.com/apache/arrow/format/functions.txt. Org 0 == Apache Arrow 
canonical definitions.
+  // Organizations < 1B are canonical organizations. For private functions, use
+  // Org id > 1B

Review comment:
       I think you may be interpreting this different from my intention. Here 
is what I would expect to be the case:
   
   - Function signature is different from implementation. Signature defines the 
semantics on the function. Each function signature has an orgId:functionId 
identifier. The route to the signature and semantics would be either:
     - arrow org id => arrow function/semantics doc (e.g. yaml)
     - arrow org list => third party function signature/semantics doc
   - A consumer of the plan (e.g. an execution engine) would try reject a plan 
if any orgId:functionId identifiers are not implemented. If they are all 
implemented, the consumer should be able to complete the plan.
   - Known orgids are defined in the apache arrow repository so that people can 
go somewhere to get the specification for 
   - We could, at some point, have a way to map function signatures to 
implementations but initially, I wouldn't try to build this into the 
specifivation.
   - In Arrow there would be ~2000-4000 well defined "sql" functions signatures 
that are defined. One or more of the language implementations will have their 
own way to bind from signature to function implementation.
   - Producer and consumers could coordinate their available signatures. For 
example, a SQL parsing layer to could probe an engine to get a list of 
available function signatures. Then the parsing layer could validate/parse the 
entire tree even though it didn't previously know about a particular function 
(for example udfs registered with a consumer).
   
   So to your questions:
   
   > I.E. say there's some arbitrary function that is standardized like 
functionA and there's both a C++ and Rust implementation of it. Would they have 
the same function IDs in different orgs or would they have the same org with 
different function IDs? 
   
   If functionA is defined at the Arrow project level, I'd expect a single 
function signature with two implementations. Each engine would responsible for 
binding to the correct implementation but the function would have a well 
defined semantic and would be expected to behave the same way.
   
   If functionA is some arbitrary third party function, you'd likely still have 
a single function signature, it would just be namespaced outside the project 
(via an orgid pointer). 
   
   > How would an IR producer allow arbitrarily targeting either implementation?
   I would expect a plan to be targeted to a single consumer. So you would 
submit to whatever consumer you wanted to use. If you wanted to split a plan so 
some was run in one consumer and some was run in a different consumer, I'd see 
that as a consumer that knew how to split a plan and then submitted plan 
segments to different sub-consumers. In other words, I don't think the plan 
itself should be expressing what consumer should be used. Maybe I'm not 
understand what exact situation you're trying to solve that I didn't outline 
above. In general, I assume that you may stack a number of "filters" between 
your original producer and one or more ultimate consumers but that this is 
based on the intelligence of each consumer, not something in the plan 
primitives. For example, I could see something like this:
   
   ```
          root producer
            |
          plan optimizer
            |
        engine 0
       /        \
   engine 1    engine 2
   ```
   
   In this example, engine 0 is a coordinating engine and further delegates 
plan segments to engine 1 and 2. Each engine then makes decisions about 
implementation of a function signature.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jacques-n commented on a change in pull request #10979: [RFC] Alternative IR approach

Reply via email to