On Wed, Sep 16, 2020 at 10:49 AM Adam Hooper <a...@adamhooper.com> wrote:
>
> On Tue, Sep 15, 2020 at 1:00 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > We have additional problems in that some file-loading related tasks do
> > a mixture of CPU work and IO work, and once a thread has been
> > dispatched to execute one of these tasks, when IO takes place, a CPU
> > core may sit underutilized while the IO is waiting.
> >
>
> Maybe it's best to say what *won't* be solved....
>
> For instance: in Workbench [https://www.workbenchdata.com], we use Arrow to
> transfer data between processes and support larger-than-RAM tables. So we
> use mmap() and expect the kernel to swap mmapped data out of RAM. It
> wouldn't be fair to call calculations "CPU-bound" because we expect them to
> page-fault -- cause blocking IO.
>
> Is mmap() the norm? Maybe there's not much point in distinguishing between
> IO and CPU. Maybe Parquet-reading is best treated as an exception....

A lot of data is stored in remote filesystems (cloud storage, etc.);
it doesn't seem acceptable to "lock" a CPU core while waiting for a
read call to S3 to return.

>
> For example, it strikes me as similar to concurrency issues
> > inside an analytic database. How are they preventing concurrent
> > workload starvation problems or handling CPU/IO task scheduling to
> > avoid CPU underutilization?
> >
>
> RDBMSs don't prevent concurrent workload starvation. They use process
> pools; they provide knobs; and they train an army of specialists to find
> compromises.

I think I'm saying something different. I'm talking about our present
FIFO task scheduling and whether that's desirable vs. allowing
independent parallelizable workloads to share the thread pool to some
extent even if the workloads don't all start at the same time (i.e. so
that the "late arriving" workload doesn't end up queued behind all the
tasks that the earlier workloads will have already pushed into the
task queue)

> For instance, Postgres supports parallel SELECT. The knobs are
> max_parallel_workers (default 8) and max_parallel_workers_per_gather
> (default 2): the former sets the process-pool size and the latter is a nod
> to fairness. The kernel handles scheduling. When all worker processes are
> busy, Postgres evades its own parallel-query logic and runs the entire
> query on the client-connection process itself. (Each connection is a
> separate process.) That knob is max_connections (default 100). So in
> effect, Postgres by default limits itself to 108 concurrent processes --
> and they can be all IO-bound or all CPU-bound.
>
> It is dead-easy to cripple an RDBMS. Heck, it's dead-easy to cripple an
> entire *computer* using an RDBMS. Also, Postgres doesn't use all processes
> by default, even when they're all available.
>
> Enter DB administrators. A DB administrator monitors queries, finds
> overly-expensive ones, and *solves the problem*. Sometimes, the solution is
> to tweak knobs. But there are so many other solutions: optimize the query;
> add an index; call the query less often (or at night); move it to a
> read-only replica; take it out of the RDBMS entirely; or even *change the
> project requirements* so the query won't be needed.
>
> The *team* solves the problem. The DB administrator can't do any of these
> things in a vacuum.
>
> Point being: it's about people. A good RDBMS provides tools. Those tools
> help a team discover which of the RDBMS's multitude of weaknesses are most
> urgent. The RDBMS provides few knobs and much documentation. The team
> selects compromises.
>
> I think a good bullet for your list of requirements is: "simple enough to
> explain to a non-programmer." A disaster *will* happen; and someone will
> need to explain it to the boss....
>
> Enjoy life,
> Adam
>
> --
> Adam Hooper
> +1-514-882-9694
> http://adamhooper.com

Reply via email to