lidavidm commented on pull request #11145:
URL: https://github.com/apache/arrow/pull/11145#issuecomment-920207471


   Thanks for the explanations.
   
   > I think my approval still stands.
   > 
   > > This fragment, when scanned, unconditionally uses a BackgroundIterator 
and transfers to the CPU executor. Hence under stress, this could cause both 
the main thread and CPU threads to be used. I think this needs to be fixed
   > 
   > I'm not sure this is a bad thing. The rules for scanners are pretty clear. 
A ScanOptions has an I/O executor which fragments are welcome to use and, if 
they do so, they should transfer back to the CPU thread pool. So 
OneShotFragment is following the rules here. The fact that it causes an 
indexing error is a consequence of 
[ARROW-13741](https://issues.apache.org/jira/browse/ARROW-13741). If that 
latter issue were resolved then it shouldn't matter at all what the scanner is 
doing.
   
   Sorry, I meant it shouldn't transfer unconditionally to *the* CPU executor, 
but rather the "main" executor. Thanks for linking the relevant JIRA. 
   
   > 
   > I think the only thing that could be changed is adding a CPU executor to 
scan options instead of threads grabbing the internal CPU thread pool. I think 
right now we hack around this with "SafeExecute".
   
   That would be good. It would also perhaps give us more control if we start 
running multiple queries in process, if we want to divvy up resources. (Though 
maybe work-stealing or a higher level executor could take care of that instead.)
   
   > 
   > > Furthermore I think it would be useful if, during a non-parallel scan, 
we could abort or otherwise enforce that the CPU thread pool is not being used.
   > 
   > A "non-parallel" scan means that the CPU and I/O executors are a serial 
executor. So the internal code shouldn't change either way and there should be 
no possible way to use the CPU thread. If we put the CPU executor into the scan 
options then it might be good to catch places we do internal::GetCpuThreadPool 
instead of using the options executor.
   
   Ah, ok - that makes sense.
   
   > 
   > > Actually, given the talks about tracing, it would be interesting if we 
could somehow capture executor usage in the traces - particularly spawning 
thread.
   > 
   > Agreed. Thread pool tasks should be a "span" and the name of the thread 
pool (also, thread pools should be named), and possibly a worker index/id, 
should be properties of the span.
   
   I'm going to look at tracing again and would like to do this. It would also 
be nice if we could get an "async stack trace" of some sorts, perhaps by 
reconstructing via spans - right now with async, the stack trace in the 
debugger is essentially useless, unfortunately.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to