On Wed, Jun 20, 2001 at 03:01:00PM -0500, Abdulaziz Ghuloum wrote:
> Hello Ron and everybody on this list.
>
> In order to find the best way to complete the task the fastest way, you need to
> first analyse your system and how it's going to react to your algorithm.
> Basically your task can be split into 2 parts, one IO intensive part (reading
> the files) and one CPU intensive part (processing the data). If you're going
> to only parse certain information from the logs, then the time it takes to
> process the data is much shorter than the time it takes to read it. However,
> if you need to do more complicated stuff (finding crossreferences in files,
> translating IPs to Hostnames, ...) then that time will dominate IO times.
>
> Now, say that IO takes longer than processing. You suggested splitting the
> list of all 1500 files so that a separate process would process a chunk of the
> data. Would this speed up your program?
i think so.
as disk IO is almost always slower than CPU processing, it would benifit, if the CPU
could
be fed something to do, while it waits for the data. having multiple threads/processes
should
increase the chances of that happening. this is assuming that the CPU is not already
overloaded,
in which case, having all those extra processes/threads might turn out to be
detrimental.
> My computer has an IDE disk and I bet
> that reading all 1500 files sequentially would be far faster than reading files
> in parallel because the disk has to do extra turns to switch from position to
> position while reading the data in sequence would eleminate the extra turns.
> This is an over simplification to what happens but you get the idea. And this
> may not be the case on your computer. So, you need to test the fastest way you
> can read the files. (If my files are distributed across 2 hard disks, I would
> fire 2 processes to manage each disk).
uh ? i am more or less of a novice, but i dont think it works that way. most multi disk
machines are managed by the RAID (hardware /software), if you are using RAID, or some
sort
of volume manager, otherwise. the end result is, u dont get to manage anything.
the abstraction that encapsulates the multiple disks makes it look like a single
device to
you, manages the disks, optimizing the data layout. (anyway do not
flame me OO gurus, if i have not used those OO terms in the right way :-].)
so in case there are multiple disks, the filesystem/driver should layout the data so
that
it is spread across multiple disks, to increase the parallelism in the reads. you
would then
get the advantage of increased speed, whether you have multiple processes or not.
/kk