hi Geoff,

It's great to hear about your benchmarking results. If you'd like to
submit a pull request to the project with a blog post to showcase your
results please be our guest (the blog content is all under the site/
directory, let us know if you have any issues). The only limitation on
the Apache Arrow blog is that we must maintain independence and not
endorse (or otherwise make subjective comments about) any third party
companies, products, or services.

best,
Wes

On Wed, Jun 12, 2019 at 12:14 PM Lentner, Geoffrey R
<glent...@purdue.edu> wrote:
>
> Hi everyone.
>
> I work as a data scientist in the research computing group at Purdue 
> University. Mostly I help facilitate the use of Purdue’s supercomputing 
> clusters by research faculty by helping with scientific software development, 
> consulting on data analysis and data management, holding workshops, etc. I 
> have started suggesting Apache Arrow in a lot of my conversations.
>
> I have had great success using the Plasma store as a way of holding 
> “reference data” and having many workers access it without the need to do any 
> duplication (persistent worker - many tasks) or out-of-core computing (simple 
> workers - lazy loading). This benefits many scenarios (e.g., rendering frames 
> in a POV based visualization; many frames - single data).
>
> Towards the end of June I’ll be running a benchmark on one of our clusters 
> (Brown - https://www.rcac.purdue.edu/compute/brown) to test performance in 
> scaling IPyParallel past the 500+ node mark (20,000 workers). One of the 
> tasks is to load a dataset from disk (and then do something trivial like 
> compute summary statistics). This stresses both the file system (Lustre) and 
> the network (IB fabric). The comparison was to have every node running its 
> own Plasma store and pre-load the dataset and have the workers access it via 
> client connections to the store.
>
> Preliminary tests (up to 48 nodes) showed that with a 2 GB dataset taking ~20 
> seconds (almost entirely loading/parsing) doing the statistics on a data 
> structure accessed via Plasma took < 1s (validating that the compute time is 
> merely that of the summary statistics on an already loaded dataset). And that 
> this was sustained clear up to the 24 cores per node (meaning 24 client 
> connections and attempted simultaneous access).
>
> If there is interest, I would be thrilled to compose the results of this test 
> into a blog post for the website - detailing the workflow, where this 
> approach can be applied in different research domains, and performance 
> comparisons.
>
>
> Cheers.
> Geoff
>
> --
> Geoffrey Lentner <glent...@purdue.edu<mailto:glent...@purdue.edu>>
> Data Scientist. ITaP Research Computing. Purdue University.
> @PurdueRCAC @GeoffreyLentner

Reply via email to