Parallelizing the processing isn't a bad idea, though in this case I think
I concur with your thoughts about hammering the NFS server. Are you making
a copy of the data to process? If your processing threads utilize a "local"
copy of the data, you needn't worry as much about the network. If you
just wanted more processing threads, I presume you don't need to spin up
more instances to parallelize things. One big instance should be easier
than multiple instances, and if needed your project could get access to
higher CPU and memory limits.

On Sun, Apr 18, 2021 at 9:07 PM Roy Smith <[email protected]> wrote:

> I'm exploring various ways of working with the XML data dumps on
> /publib/dumps/public/enwiki.  I've got a process which runs through all of
> the enwiki-20210301-pages-articles[123456789]*.xml* files in about 6
> hours.  If I've done the math right, that's just about 18 GB of data, or 3
> GB/h, or 8 MB/s that I'm slurping off NFS.
>
> If I were to spin up 8 VPS nodes and run 8 jobs in parallel, in theory I
> could process 64 MB/s (512 Mb/s).  Is that realistic?  Or am I just going
> to beat the hell out of the poor NFS server, or peg some backbone network
> link, or hit some other rate limiting bottleneck long before I run out of
> CPU?  Hitting a bottleneck doesn't bother me so much as not wanting to
> trash a shared resource by doing something stupid to it.
>
> Putting it another way, would trying this be a bad idea?
>
>
> _______________________________________________
> Wikimedia Cloud Services mailing list
> [email protected] (formerly [email protected])
> https://lists.wikimedia.org/mailman/listinfo/cloud
>


-- 
*Nicholas Skaggs*
Engineering Manager, Cloud Services
Wikimedia Foundation <https://wikimediafoundation.org/>
_______________________________________________
Wikimedia Cloud Services mailing list
[email protected] (formerly [email protected])
https://lists.wikimedia.org/mailman/listinfo/cloud

Reply via email to