On 28/10/16 15:58, A. Soroka wrote:
That's right-- if I understand it correctly, there are two steps--
POSIX sort to develop the index orderings, and then packing the
actual index files.

For the POSIX sort step, it's certainly true that more parallelism
than needed would be a bad thing. With Andy's help I just made a
commit that allows a little more control using the common --parallel
flag for sort. But the current ergonomics seem suboptimal. E.g. with
the current settings indexing a 300Mt dataset on a 24-core box with
fast storage, I saw only one core in full use, and very little IO
usage. Aliasing in some parallelism via the sort flag brought several
more cores into play and cut the time spent by two-thirds.

What rates are you getting?

I don't
know how normal that is, but for the sort step, my argument is not
that we could find universally better ergonomics, but that we could
bake some flexibility in for those who want to try adjustments on
their particular hardware, including the ability to try running
multiple sorts at one time.

That sounds like an interesting experiment to carry out and if successful change the released code.


For the other step, I don't feel like I understand the index-packing
code well enough yet to form an opinion, which is one reason for the
question. It seems that it could run in parallel without difficulty,
but maybe I don't understand the relationships between the indexes
well enough.

Index packing is I/O bound and is sequential. There is little computation going on.

Doing two packings in parallel would break up the sequential write sequence so there would be need to be a noticable gain in some way to compensate for the impact.

Bus contention when it's a SSD may come into play. The quality/speed of the connection to the SSD is related to how much $$$ the server cost!

Another question then would be: maybe we could split the current
'index' phase into 'order' and 'pack' phases, again for those who
would like to try tuning each step for their situation?

Interesting possibility - needs trying out and bedding down before it goes into the standard release scripts IMO. What works well in one environment may not in another. Lots of options suits some people and not others.

    Andy

---
A. Soroka
The University of Virginia Library

On Oct 28, 2016, at 10:24 AM, Rob Vesse <[email protected]> wrote:

If memory serves those are the phases that use POSIX sort right?

Sort will try and do an in-memory sort as far as possible and fall back to a 
disk-based merge sort if not. Also we usually configure sort to run in parallel

If you try to process different indexing in parallel you would create a lot of 
memory and disk contention which would likely slowdown overall performance

For sufficiently large data sets there is also a risk of exhausting disk space 
during the sort phase and building multiple indexes in parallel would only 
exacerbate this

Rob

On 28/10/2016 14:33, "A. Soroka" <[email protected]> wrote:

   I'm still learning about tdbloader2 and have another question about the 
index phase: is there any reason why the processes for the various index 
orderings (SPO, GSPO, etc.) couldn't go on in parallel? Or am I missing some 
switch or setting that already allows that?

   ---
   A. Soroka
   The University of Virginia Library







Reply via email to