Re: [gpfsug-discuss] GPFS GPFS-FPO and Hadoop (specifically MapReduce in the first instance)

Ed Wahl Wed, 16 Jul 2014 08:31:50 -0700

It seems to me that neither IBM nor Intel have done a good job with the 
marketing and pre-sales of their hadoop connectors.


  As my site hosts both GPFS and Lustre I've been paying attention to this.  
Soon enough I'll need some hadoop and I've been rather interested in who tells 
a convincing story.  With IBM it's been like pulling teeth, so far, to get FPO 
info. (other than pricing)  Intel has only been slightly better with EE.

It was better with Panache, aka AFM,  and there are now quite a few external 
folks doing all kinds of interesting things with it.  From standard caching to 
trying local only burst buffers.  I'm hopeful that we'll start to see the same 
with FPO and EE soon.

I'll be very interested to hear more in this vein.

Ed
OSC

----- Reply message -----
From: "Laurence Alexander Hurst" <[email protected]>
To: "[email protected]" <[email protected]>
Subject: [gpfsug-discuss] GPFS GPFS-FPO and Hadoop (specifically MapReduce in 
the first instance)
Date: Wed, Jul 16, 2014 10:21 AM

Dear GPFSUG,

I've been looking into the possibility of using GPFS with Hadoop, especially as 
we already have experience with GPFS (traditional san-based) cluster for our 
HPC provision (which is part of the same network fabric, so integration should 
be possible and would be desirable).

The proof-of-concept Hadoop cluster I've setup has HDFS as well as our current 
GPFS file system exposed (to allow users to import/export their data from HDFS 
to the shared filestore).  HDFS is a pain to get data in and out of and also 
precludes us using many deployment tools to mass-update the nodes (I know this 
would also be a problem with GPFS-FPO) by reimage and/or reinstall.

It appears that the GPFS-FPO product is intended to provide HDFS's performance 
benefits for highly distributed data-intensive workloads with the same ease of 
use of a traditional GPFS filesystem.  One of the things I'm wondering is; can 
we link this with our existing GPFS cluster sanely?  This would avoid having to 
have additional filesystem gateway servers for our users to import/export their 
data from outside the system and allow, as seemlessly as possible, a clear 
workflow from generating large datasets on the HPC facility to analysing them 
(e.g. with a MapReduce function) on the Hadoop facility.

Looking at FPO it appears to require being setup as a separate 'shared-nothing' 
cluster, with additional FPO and (at least 3) server licensing costs attached.  
Presumably we could then use AFM to ingest(/copy/sync) data from a 
Hadoop-specific fileset on our existing GPFS cluster to the FPO cluster, 
removing the requirement for additional gateway/heads for user (data) access?  
At least, based on what I've read so far this would be the way we would have to 
do it but it seems convoluted and not ideal.

Or am I completely barking up the wrong tree with FPO?

Has anyone else run Hadoop alongside, or on top of, an existing san-based GPFS 
cluster (and wanted to use data stored on that cluster)?  Any tips, if you 
have?  How does it (traditional GPFS or GPFS-FPO) compare to HDFS, especial 
regards performance (I know IBM have produced lots of pretty graphs showing how 
much more performant than HDFS GPFS-FPO is for particular use cases)?

Many thanks,

Laurence
--
Laurence Hurst, IT Services, University of Birmingham, Edgbaston, B15 2TT
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] GPFS GPFS-FPO and Hadoop (specifically MapReduce in the first instance)

Reply via email to