[jira] [Commented] (CALCITE-849) Streams/Slow iterators dont close on statement close

Julian Hyde (JIRA) Tue, 10 Nov 2015 16:27:41 -0800

    [ 
https://issues.apache.org/jira/browse/CALCITE-849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14999669#comment-14999669
 ]


Julian Hyde commented on CALCITE-849:
-------------------------------------

Sorry I've been tardy in replying.

I was thinking of something a bit different for co-operative multi-tasking, 
which is undoubtedly more invasive than your proposal but yields bigger 
returns. My idea was to change the basic API away from Enumerable and instead 
have a "work" method that does some work and returns when the input is empty, 
the output is full, or some quantum is exceeded:

{code}
interface Useful {
  void work(Input input, Output output, int quantum);
}
{code}

Implementation for {{Filter(deptno < 20)}}:

{code}
  public void work(Input input, Output output, int quantum) {
    for (;;) {
      if (!input.take()) {
        return;
      }
      Row row = input.current();
      int deptno = row.getInt(3);
      if (deptno < 20) {
        output.send(row);
        if (output.isFull()) {
          return;
        }        
      }
    }
  }
{code}

Note that this method doesn't even check {{quantum}}. Not necessary in this 
case, because the input buffer size limits the amount of work.

The reason that I believe this interface is superior to Enumerable is because 
it is operating on many rows at a time. It wouldn't be hard to get rid of the 
Row object and not use any heap objects at all.

Your proposed changes seem non-invasive but they seem to be fragile, because 
each consumer would have to be aware that sometimes 'end of data' really means 
'end of data (for now)'.

Also, Enumerable receives a row from the producer by calling the producer. That 
is, it uses pull mode. Quite often streams need push mode -- for example when 
splitting a stream to two consumers based on a boolean condition, or when the 
producer is very bursty and we don't want to tie up a thread. My proposed API 
isn't push or pull, but puts control into the hands of a scheduler, which can 
then simulate  either push or pull.

A simple scheduler could invoke all operators one at a time in a loop. A more 
complex scheduler could use a pool of worker threads. When you say {quote} 
Design-wise, I'm thinking about managing requests through a queue with a simple 
shared state of:
* running
* next N rows.{quote} I think you're talking about what I call a scheduler. 
Each worker thread repeatedly calls the scheduler can gets told (a) here's some 
work, (b) wait a while, (c) die.

{quote}From there, my open question is how should two different calls to 'run' 
on different threads be viewed by the callee on the same statement? Should the 
second one see the next set of rows or should it go back to the beginning and 
see the the results from there?{quote}

I think that's a question about streaming SQL semantics. I believe that a 
stream should act like a "topic" in JMS, so every consumer sees every row. If 
two statements execute 'select stream * from Orders' they will see the orders. 
If one statement executes 'select stream * from Orders union all select stream 
* from Orders' it will see 2 of each order. There might also be a concept like 
a "queue" in JMS, so precisely one consumer sees each row, but it shouldn't be 
the default.

> Streams/Slow iterators dont close on statement close
> ----------------------------------------------------
>
>                 Key: CALCITE-849
>                 URL: https://issues.apache.org/jira/browse/CALCITE-849
>             Project: Calcite
>          Issue Type: Bug
>            Reporter: Jesse Yates
>            Assignee: Julian Hyde
>             Fix For: 1.5.0
>
>         Attachments: calcite-849-bug.patch
>
>
> This is easily seen when querying an infinite stream with a clause that 
> cannot be matched
> {code}
> select stream PRODUCT from orders where PRODUCT LIKE 'noMatch';
> select stream * from orders where PRODUCT LIKE 'noMatch';
> {code}
> The issue arises when accessing the results in a multi-threaded context. Yes, 
> its not a good idea (and things will break, like here). However, this case 
> feels like it ought to be an exception.
> Suppose you are accessing a stream and have a query that doesn't match 
> anything on the stream for a long time. Because of the way a ResultSet is 
> built, the call to executeQuery() will hang until the first matching result 
> is received. In that case, you might want to cancel the query because its 
> taking so long. You also want the thing that's accessing the stream (the 
> StreamTable implementation) to cancel the querying/collection - via a call to 
> close on the passed iterator/enumerable.
> Since the first result was never generated, the ResultSet was never returned 
> to the caller. You can get around this by using a second thread and keeping a 
> handle to the creating statement. When you go to close that statement though, 
> you end up not closing the cursor (and the underlying iterables/enumberables) 
> because it never finished getting created.
> It gets even more problematic if you are use select * as the iterable doesn't 
> finish getting created in the AvaticaResultSet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CALCITE-849) Streams/Slow iterators dont close on statement close

Reply via email to