Hi everyone. I work as a data scientist in the research computing group at Purdue University. Mostly I help facilitate the use of Purdue’s supercomputing clusters by research faculty by helping with scientific software development, consulting on data analysis and data management, holding workshops, etc. I have started suggesting Apache Arrow in a lot of my conversations.
I have had great success using the Plasma store as a way of holding “reference data” and having many workers access it without the need to do any duplication (persistent worker - many tasks) or out-of-core computing (simple workers - lazy loading). This benefits many scenarios (e.g., rendering frames in a POV based visualization; many frames - single data). Towards the end of June I’ll be running a benchmark on one of our clusters (Brown - https://www.rcac.purdue.edu/compute/brown) to test performance in scaling IPyParallel past the 500+ node mark (20,000 workers). One of the tasks is to load a dataset from disk (and then do something trivial like compute summary statistics). This stresses both the file system (Lustre) and the network (IB fabric). The comparison was to have every node running its own Plasma store and pre-load the dataset and have the workers access it via client connections to the store. Preliminary tests (up to 48 nodes) showed that with a 2 GB dataset taking ~20 seconds (almost entirely loading/parsing) doing the statistics on a data structure accessed via Plasma took < 1s (validating that the compute time is merely that of the summary statistics on an already loaded dataset). And that this was sustained clear up to the 24 cores per node (meaning 24 client connections and attempted simultaneous access). If there is interest, I would be thrilled to compose the results of this test into a blog post for the website - detailing the workflow, where this approach can be applied in different research domains, and performance comparisons. Cheers. Geoff -- Geoffrey Lentner <glent...@purdue.edu<mailto:glent...@purdue.edu>> Data Scientist. ITaP Research Computing. Purdue University. @PurdueRCAC @GeoffreyLentner