That's right-- if I understand it correctly, there are two steps-- POSIX sort 
to develop the index orderings, and then packing the actual index files.

For the POSIX sort step, it's certainly true that more parallelism than needed 
would be a bad thing. With Andy's help I just made a commit that allows a 
little more control using the common --parallel flag for sort. But the current 
ergonomics seem suboptimal. E.g. with the current settings indexing a 300Mt 
dataset on a 24-core box with fast storage, I saw only one core in full use, 
and very little IO usage. Aliasing in some parallelism via the sort flag 
brought several more cores into play and cut the time spent by two-thirds. I 
don't know how normal that is, but for the sort step, my argument is not that 
we could find universally better ergonomics, but that we could bake some 
flexibility in for those who want to try adjustments on their particular 
hardware, including the ability to try running multiple sorts at one time.

For the other step, I don't feel like I understand the index-packing code well 
enough yet to form an opinion, which is one reason for the question. It seems 
that it could run in parallel without difficulty, but maybe I don't understand 
the relationships between the indexes well enough.

Another question then would be: maybe we could split the current 'index' phase 
into 'order' and 'pack' phases, again for those who would like to try tuning 
each step for their situation?

---
A. Soroka
The University of Virginia Library

> On Oct 28, 2016, at 10:24 AM, Rob Vesse <[email protected]> wrote:
> 
> If memory serves those are the phases that use POSIX sort right?
> 
> Sort will try and do an in-memory sort as far as possible and fall back to a 
> disk-based merge sort if not. Also we usually configure sort to run in 
> parallel
> 
> If you try to process different indexing in parallel you would create a lot 
> of memory and disk contention which would likely slowdown overall performance
> 
> For sufficiently large data sets there is also a risk of exhausting disk 
> space during the sort phase and building multiple indexes in parallel would 
> only exacerbate this
> 
> Rob
> 
> On 28/10/2016 14:33, "A. Soroka" <[email protected]> wrote:
> 
>    I'm still learning about tdbloader2 and have another question about the 
> index phase: is there any reason why the processes for the various index 
> orderings (SPO, GSPO, etc.) couldn't go on in parallel? Or am I missing some 
> switch or setting that already allows that?
> 
>    ---
>    A. Soroka
>    The University of Virginia Library
> 
> 
> 
> 
> 
> 

Reply via email to