[
https://issues.apache.org/jira/browse/CALCITE-849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14999669#comment-14999669
]
Julian Hyde commented on CALCITE-849:
-------------------------------------
Sorry I've been tardy in replying.
I was thinking of something a bit different for co-operative multi-tasking,
which is undoubtedly more invasive than your proposal but yields bigger
returns. My idea was to change the basic API away from Enumerable and instead
have a "work" method that does some work and returns when the input is empty,
the output is full, or some quantum is exceeded:
{code}
interface Useful {
void work(Input input, Output output, int quantum);
}
{code}
Implementation for {{Filter(deptno < 20)}}:
{code}
public void work(Input input, Output output, int quantum) {
for (;;) {
if (!input.take()) {
return;
}
Row row = input.current();
int deptno = row.getInt(3);
if (deptno < 20) {
output.send(row);
if (output.isFull()) {
return;
}
}
}
}
{code}
Note that this method doesn't even check {{quantum}}. Not necessary in this
case, because the input buffer size limits the amount of work.
The reason that I believe this interface is superior to Enumerable is because
it is operating on many rows at a time. It wouldn't be hard to get rid of the
Row object and not use any heap objects at all.
Your proposed changes seem non-invasive but they seem to be fragile, because
each consumer would have to be aware that sometimes 'end of data' really means
'end of data (for now)'.
Also, Enumerable receives a row from the producer by calling the producer. That
is, it uses pull mode. Quite often streams need push mode -- for example when
splitting a stream to two consumers based on a boolean condition, or when the
producer is very bursty and we don't want to tie up a thread. My proposed API
isn't push or pull, but puts control into the hands of a scheduler, which can
then simulate either push or pull.
A simple scheduler could invoke all operators one at a time in a loop. A more
complex scheduler could use a pool of worker threads. When you say {quote}
Design-wise, I'm thinking about managing requests through a queue with a simple
shared state of:
* running
* next N rows.{quote} I think you're talking about what I call a scheduler.
Each worker thread repeatedly calls the scheduler can gets told (a) here's some
work, (b) wait a while, (c) die.
{quote}From there, my open question is how should two different calls to 'run'
on different threads be viewed by the callee on the same statement? Should the
second one see the next set of rows or should it go back to the beginning and
see the the results from there?{quote}
I think that's a question about streaming SQL semantics. I believe that a
stream should act like a "topic" in JMS, so every consumer sees every row. If
two statements execute 'select stream * from Orders' they will see the orders.
If one statement executes 'select stream * from Orders union all select stream
* from Orders' it will see 2 of each order. There might also be a concept like
a "queue" in JMS, so precisely one consumer sees each row, but it shouldn't be
the default.
> Streams/Slow iterators dont close on statement close
> ----------------------------------------------------
>
> Key: CALCITE-849
> URL: https://issues.apache.org/jira/browse/CALCITE-849
> Project: Calcite
> Issue Type: Bug
> Reporter: Jesse Yates
> Assignee: Julian Hyde
> Fix For: 1.5.0
>
> Attachments: calcite-849-bug.patch
>
>
> This is easily seen when querying an infinite stream with a clause that
> cannot be matched
> {code}
> select stream PRODUCT from orders where PRODUCT LIKE 'noMatch';
> select stream * from orders where PRODUCT LIKE 'noMatch';
> {code}
> The issue arises when accessing the results in a multi-threaded context. Yes,
> its not a good idea (and things will break, like here). However, this case
> feels like it ought to be an exception.
> Suppose you are accessing a stream and have a query that doesn't match
> anything on the stream for a long time. Because of the way a ResultSet is
> built, the call to executeQuery() will hang until the first matching result
> is received. In that case, you might want to cancel the query because its
> taking so long. You also want the thing that's accessing the stream (the
> StreamTable implementation) to cancel the querying/collection - via a call to
> close on the passed iterator/enumerable.
> Since the first result was never generated, the ResultSet was never returned
> to the caller. You can get around this by using a second thread and keeping a
> handle to the creating statement. When you go to close that statement though,
> you end up not closing the cursor (and the underlying iterables/enumberables)
> because it never finished getting created.
> It gets even more problematic if you are use select * as the iterable doesn't
> finish getting created in the AvaticaResultSet.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)