Re: Re: Translate join predicates into filters

Stamatis Zampetakis Tue, 23 May 2023 02:09:57 -0700

There is also the EnumerableBatchNestedLoopJoin [1] algorithm which
can batch multiple values into one request. There are also some past
discussions in the list around this concept (for instance [2]).


Best,
Stamatis

[1] 
https://github.com/apache/calcite/blob/9c33d7aeefe082bf5f7be617ef231e1285418a6c/core/src/main/java/org/apache/calcite/adapter/enumerable/EnumerableBatchNestedLoopJoinRule.java
[2] https://lists.apache.org/thread/wtdr5lvycfwz6k89kzry73k4tw0lqxbx

On Tue, May 23, 2023 at 1:50 AM Luca Marchi <[email protected]> wrote:
>
> Hi, thanks for the reply!
>
> Right, the nested loop approach iterates each row, this means that we can not 
> push the references as filter of the `book` table in one request, but one by 
> one.
>
> The second approach sounds promising; can you show us a similar example?
> Thanks!
>
> On 2023/05/19 01:14:49 Julian Hyde wrote:
> > I expect that your table works if you put the filter in the WHERE clause, 
> > e.g.
> >
> >   SELECT *
> >   FROM Books AS b
> >   WHERE b.id <http://b.id/> IN (1, 10, 27)
> >
> > and it does so using FilterTableScanRule (which matches a Filter on top of 
> > a Scan of a FilterableTable). But you need a new planner rule that can 
> > convert a Join whose right input is a Scan of a FilterableTable into a 
> > NestedLoopsJoin that dynamically sets the filter for each row from the 
> > left. (In this query, config would be on the left, Books on the right.)
> >
> > There could be a more efficient version that gathers all IDs from the left, 
> > then does one request to the right, and them something like a hash join.
> >
> > Julian
> >
> >
> >
> >
> > > On May 17, 2023, at 8:41 AM, Luca Marchi <[email protected]> wrote:
> > >
> > > Morning everyone,
> > > in our company we are running a POC using Apache Calcite, and we would 
> > > like to collect some feedbacks from you for the scenario mentioned below.
> > >
> > > There is a service API that allows retrieving some `Book`s, and we would 
> > > like to build a table adapter on top of this service; this API
> > > only accepts a set of IDs, and if no IDs are provided, no result is 
> > > returned.
> > >
> > > ```
> > >  interface BookService {
> > >    /** Returns the books matching the given IDs.
> > >     *
> > >     * <p>If not IDs is provided, no result is returned.
> > >     */
> > >    List<Book> findBooksByIds(Set<String> ids);
> > >  }
> > >
> > >  record Book(String id, String title) {};
> > > ```
> > >
> > > A requirement of this table is that it has to support join, and we would 
> > > like to support joining by ID in an efficient way.
> > >
> > > The goal is to define a rule that forces the query planner to always push 
> > > down join predicates into a table scan.
> > >
> > > Given the following `book` table:
> > >
> > > ```java
> > > /** A table which represents books, queryable only by their ID. */
> > > final class BookTable extends AbstractTable implements FilterableTable {
> > >  private final BookService service;
> > >
> > >  BookTable(BookService service) {
> > >    this.service = service;
> > >  }
> > >
> > >  @Override
> > >  public RelDataType getRowType(RelDataTypeFactory typeFactory) {
> > >    return new RelDataTypeFactory.Builder(typeFactory)
> > >        .add("id", SqlTypeName.VARCHAR)
> > >        .add("title", SqlTypeName.VARCHAR)
> > >        .build();
> > >  }
> > >
> > >  @Override
> > >  public Enumerable<Object[]> scan(DataContext root, List<RexNode> 
> > > filters) {
> > >    Set<String> bookIds = getBooksId(filters);
> > >    List<Object[]> result = service.findBooksByIds(bookIds)
> > >        .stream()
> > >        .map(b -> new Object[]{b.id, b.title})
> > >        .toList();
> > >
> > >    return Linq4j.asEnumerable(result);
> > >  }
> > >
> > >  private static Set<String> getBooksId(List<RexNode> filters) {
> > >    if (filters.size() != 1) {
> > >      throw new IllegalArgumentException("Expected one filter to the ID, 
> > > found: %d".formatted(filters.size()));
> > >    }
> > >
> > >    RexNode filter = filters.get(0);
> > >    RexNode leftCondition = ((RexCall) filter).getOperands().get(0);
> > >    RexNode rightCondition = ((RexCall) filter).getOperands().get(1);
> > >
> > >    if (leftCondition instanceof RexInputRef left
> > >        && rightCondition instanceof RexLiteral right
> > >        // The index of the ID column is 1.
> > >        && left.getIndex() == 1) {
> > >      if (filter.isA(SqlKind.EQUALS)) {
> > >        String bookId = right.getValue2().toString();
> > >        return ImmutableSet.of(bookId);
> > >      }
> > >      if (filter.isA(SqlKind.SEARCH)) {
> > >        @SuppressWarnings("unchecked")
> > >        Sarg<NlsString> searchArguments = right.getValueAs(Sarg.class);
> > >        return searchArguments.rangeSet.asRanges().stream()
> > >            .map(Range::lowerEndpoint)
> > >            .map(NlsString::getValue)
> > >            .collect(toSet());
> > >      }
> > >    }
> > >    throw new IllegalArgumentException("Unexpected operator, found: 
> > > %s".formatted(filter.getKind()));
> > >  }
> > > }
> > > ```
> > >
> > > The API of the `BookService` always expects a set of IDs, and in case of 
> > > query like (assuming an entry in `book` matching the ID `'a'`):
> > > ```sql
> > > WITH config (id, val) AS (
> > >    VALUES ('a', 3), ('b', 5)
> > > )
> > > SELECT b.* FROM books b
> > > INNER JOIN config ON b.id = config.id
> > > WHERE config.val > 4
> > > ```
> > >
> > > Calcite produces the following plan:
> > > ```
> > > EnumerableCalc(expr#0..2=[{inputs}], id=[$t1], title=[$t2])
> > >  EnumerableMergeJoin(condition=[=($0, $1)], joinType=[inner])
> > >    EnumerableSort(sort0=[$0], dir0=[ASC])
> > >      EnumerableCalc(expr#0..1=[{inputs}], id=[$t0])
> > >        EnumerableTableScan(table=[[books]])
> > >    EnumerableSort(sort0=[$0], dir0=[ASC])
> > >      EnumerableTableScan(table=[[books]])
> > > ```
> > >
> > > This means Calcite performs a full table scan of the `book` table, and 
> > > since the `RexNode` filters in the `scan` method are empty, no result is 
> > > returned (in this example we are using a value statement scoped views, 
> > > but ideally the solution we are looking for should be valid for other 
> > > table).
> > > Under some defined circustances, Postgres generates Nested-Loop query 
> > > plan for join: it first selects the row of the table A matching a given 
> > > condition, then iterates over the retrieved rows and performs a scan of 
> > > table B looking for rows that match the join condition; and this seems 
> > > something we would like to enforce it here.
> > >
> > > In summary, we would like to implement a table which in case of JOIN is 
> > > capable of loading the individual IDs matching the API of our service, 
> > > rather than performing a full table scan.
> > >
> > > Do you have any advises/feedback for us?
> > >
> > > Thanks in advance.
> >
> >

Re: Re: Translate join predicates into filters

Reply via email to