Re: Hadoop-on-demand and torque

2012-05-21 Thread Charles Earl
Ralph,
Do you have any YARN or Mesos performance comparison against HOD? I suppose 
since it was customer requirement you might not have explored it. MPI support 
seems to be active issue for Mesos now.
Charles

On May 21, 2012, at 10:36 AM, Ralph Castain r...@open-mpi.org wrote:

 Not quite yet, though we are working on it (some descriptive stuff is around, 
 but needs to be consolidated). Several of us started working together a 
 couple of months ago to support the MapReduce programming model on HPC 
 clusters using Open MPI as the platform. In working with our customers and 
 OMPI's wide community of users, we found that people were interested in this 
 capability, wanted to integrate MPI support into their MapReduce jobs, and 
 didn't want to migrate their clusters to YARN for various reasons.
 
 We have released initial versions of two new tools in the OMPI developer's 
 trunk, scheduled for inclusion in the upcoming 1.7.0 release:
 
 1. mr+ - executes the MapReduce programming paradigm. Currently, we only 
 support streaming data, though we will extend that support shortly. All HPC 
 environments (rsh, SLURM, Torque, Alps, LSF, Windows, etc.) are supported. 
 Both mappers and reducers can utilize MPI (independently or in combination) 
 if they so choose. Mappers and reducers can be written in any of the typical 
 HPC languages (C, C++, and Fortran) as well as Java (note: OMPI now comes 
 with Java MPI bindings).
 
 2. hdfsalloc - takes a list of files and obtains a resource allocation for 
 the nodes upon which those files reside. SLURM and Moab/Maui are currently 
 supported, with Gridengine coming soon.
 
 There will be a public announcement of this in the near future, and we expect 
 to integrate the Hadoop 1.0 and Hadoop 2.0 MR classes over the next couple of 
 months. By the end of this summer, we should have a full-featured public 
 release.
 
 
 On May 20, 2012, at 2:10 PM, Brian Bockelman wrote:
 
 Hi Ralph,
 
 I admit - I've only been half-following the OpenMPI progress.  Do you have a 
 technical write-up of what has been done?
 
 Thanks,
 
 Brian
 
 On May 20, 2012, at 9:31 AM, Ralph Castain wrote:
 
 FWIW: Open MPI now has an initial cut at MR+ that runs map-reduce under 
 any HPC environment. We don't have the Java integration yet to support the 
 Hadoop MR class, but you can write a mapper/reducer and execute that 
 programming paradigm. We plan to integrate the Hadoop MR class soon.
 
 If you already have that integration, we'd love to help port it over. We 
 already have the MPI support completed, so any mapper/reducer could use it.
 
 
 On May 20, 2012, at 7:12 AM, Pierre Antoine DuBoDeNa wrote:
 
 We run similar infrastructure in a university project.. we plan to install
 hadoop.. and looking for alternatives based on hadoop in case the pure
 hadoop is not working as expected.
 
 Keep us updated on the code release.
 
 Best,
 PA
 
 2012/5/20 Stijn De Weirdt stijn.dewei...@ugent.be
 
 hi all,
 
 i'm part of an HPC group of a university, and we have some users that are
 interested in Hadoop to see if it can be useful in their research and we
 also have researchers that are using hadoop already on their own
 infrastructure, but that is is not enough reason for us to start with
 dedicated dedicated Hadoop infrastructure  (we are now only running torque
 based clusters with and without shared storage; setting up and properly
 maintaining Hadoop infrastructure requires quite some understanding of new
 software)
 
 to be able to support these needs we wanted to do just this: use current
 HPC infrastructure to make private hadoop clusters so people can do some
 work. if we attract enough interest, we will probably setup dedicated
 infrastructure, but by that time we (the admins) will also have a better
 understanding of what is required.
 
 so we used to look at HOD for testing/running hadoop on existing
 infrastructure (never really looked at myhadoop though).
 but (imho) the current HOD code base is not in such a good state. we did
 some work to get it working and added some features, to come to the
 conclusion that it was not sufficient (and not maintainable).
 
 so we wrote something from scratch with same functionality as HOD, and
 much more (eg HBase is now possible, with or without MR1; some default
 tuning; easy to add support for yarn instead of MR1).
 it has some suport for torque, but my laptop is also sufficient. (the
 torque support is a wrapper to submit the job)
 we gave a workshop on hadoop using it (25 people, and each with their own
 5 node hadoop cluster) and it went rather well.
 
 it's not in a public repo yet, but we could do that. if interested, let me
 know, and i see what can be done. (releasing the code is on our todo list,
 but if there is some demand, we can do it sooner)
 
 
 stijn
 
 
 
 On 05/18/2012 05:07 PM, Pierre Antoine DuBoDeNa wrote:
 
 I am also interested to learn about myHadoop as I use a shared storage
 system and everything runs on VMs and not 

Re: Text Analysis

2012-04-25 Thread Charles Earl
If you've got existing R code, you might want to look at this 
http://www.quora.com/How-can-R-and-Hadoop-be-used-together.
Quora posting, also by Cloudera, or the rhipe R Hadoop package 
https://github.com/saptarshiguha/RHIPE/wiki
Mahout and Lucene/Solr offer some level of text analysis, although I would not 
call these complete text analysis packages.
What I've found are specific algorithms as opposed to a complete package: for 
example LDA for topic discovery -- Mahout and Yahoo Research 
(https://github.com/shravanmn/Yahoo_LDA) have Hadoop based implementations -- 
in the case of Yahoo_LDA the data is stored in HDFS, while the computation is 
essentially MPI based. Whether the algorithm reads data from HDFS store and 
uses another approach other than map reduce is another question.
C

On Apr 25, 2012, at 12:47 PM, Jagat wrote:

 There are Api which you can use , offcourse they are third party.
 
 ---
 Sent from Mobile , short and crisp.
 On 25-Apr-2012 8:57 PM, Robert Evans ev...@yahoo-inc.com wrote:
 
 Hadoop itself is the core Map/Reduce and HDFS functionality.  The higher
 level algorithms like sentiment analysis are often done by others.
 Cloudera has a video from HadoopWorld 2010 about it
 
 
 http://www.cloudera.com/resource/hw10_video_sentiment_analysis_powered_by_hadoop/
 
 And there are likely to be other tools like R that can help you out with
 it.  I am not really sure if mahout offers sentiment analysis or not, but
 you might want to look there too http://mahout.apache.org/
 
 --Bobby Evans
 
 
 On 4/25/12 7:50 AM, karanveer.si...@barclays.com 
 karanveer.si...@barclays.com wrote:
 
 Hi,
 
 I wanted to know if there are any existing API's within Hadoop for us to
 do some text analysis like sentiment analysis, etc. OR are we to rely on
 tools like R, etc. for this.
 
 
 Regards,
 Karanveer
 
 
 
 
 
 This e-mail and any attachments are confidential and intended
 solely for the addressee and may also be privileged or exempt from
 disclosure under applicable law. If you are not the addressee, or
 have received this e-mail in error, please notify the sender
 immediately, delete it from your system and do not copy, disclose
 or otherwise act upon any part of this e-mail or its attachments.
 
 Internet communications are not guaranteed to be secure or
 virus-free.
 The Barclays Group does not accept responsibility for any loss
 arising from unauthorised access to, or interference with, any
 Internet communications by any third party, or from the
 transmission of any viruses. Replies to this e-mail may be
 monitored by the Barclays Group for operational or business
 reasons.
 
 Any opinion or other information in this e-mail or its attachments
 that does not relate to the business of the Barclays Group is
 personal to the sender and is not given or endorsed by the Barclays
 Group.
 
 Barclays Bank PLC. Registered in England and Wales (registered no.
 1026167).
 Registered Office: 1 Churchill Place, London, E14 5HP, United
 Kingdom.
 
 Barclays Bank PLC is authorised and regulated by the Financial
 Services Authority.
 
 



Re: Hadoop streaming or pipes ..

2012-04-05 Thread Charles Earl
Also bear in mind that there is a kind of detour involved, in the sense that a 
pipes map must send key,value data back to the Java process and then to reduce 
(more or less). 
I think that the Hadoop C Extension (HCE, there is a patch) is supposed to be 
faster. 
Would be interested to know if the community has any experience with HCE 
performance.
C

On Apr 5, 2012, at 3:49 PM, Robert Evans ev...@yahoo-inc.com wrote:

 Both streaming and pipes do very similar things.  They will fork/exec a 
 separate process that is running whatever you want it to run.  The JVM that 
 is running hadoop then communicates with this process to send the data over 
 and get the processing results back.  The difference between streaming and 
 pipes is that streaming uses stdin/stdout for this communication so 
 preexisting processing like grep, sed and awk can be used here.  Pipes uses a 
 custom protocol with a C++ library to communicate.  The C++ library is tagged 
 with SWIG compatible data so that it can be wrapped to have APIs in other 
 languages like python or perl.
 
 I am not sure what the performance difference is between the two, but in my 
 own work I have seen a significant performance penalty from using either of 
 them, because there is a somewhat large overhead of sending all of the data 
 out to a separate process just to read it back in again.
 
 --Bobby Evans
 
 
 On 4/5/12 1:54 PM, Mark question markq2...@gmail.com wrote:
 
 Hi guys,
  quick question:
   Are there any performance gains from hadoop streaming or pipes over
 Java? From what I've read, it's only to ease testing by using your favorite
 language. So I guess it is eventually translated to bytecode then executed.
 Is that true?
 
 Thank you,
 Mark
 


Re: Streaming Hadoop using C

2012-03-01 Thread Charles Earl
How was your experience of starfish?
C
On Mar 1, 2012, at 12:35 AM, Mark question wrote:

 Thank you for your time and suggestions, I've already tried starfish, but
 not jmap. I'll check it out.
 Thanks again,
 Mark
 
 On Wed, Feb 29, 2012 at 1:17 PM, Charles Earl charles.ce...@gmail.comwrote:
 
 I assume you have also just tried running locally and using the jdk
 performance tools (e.g. jmap) to gain insight by configuring hadoop to run
 absolute minimum number of tasks?
 Perhaps the discussion
 
 http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task
 might be relevant?
 On Feb 29, 2012, at 3:53 PM, Mark question wrote:
 
 I've used hadoop profiling (.prof) to show the stack trace but it was
 hard
 to follow. jConsole locally since I couldn't find a way to set a port
 number to child processes when running them remotely. Linux commands
 (top,/proc), showed me that the virtual memory is almost twice as my
 physical which means swapping is happening which is what I'm trying to
 avoid.
 
 So basically, is there a way to assign a port to child processes to
 monitor
 them remotely (asked before by Xun) or would you recommend another
 monitoring tool?
 
 Thank you,
 Mark
 
 
 On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.com
 wrote:
 
 Mark,
 So if I understand, it is more the memory management that you are
 interested in, rather than a need to run an existing C or C++
 application
 in MapReduce platform?
 Have you done profiling of the application?
 C
 On Feb 29, 2012, at 2:19 PM, Mark question wrote:
 
 Thanks Charles .. I'm running Hadoop for research to perform duplicate
 detection methods. To go deeper, I need to understand what's slowing my
 program, which usually starts with analyzing memory to predict best
 input
 size for map task. So you're saying piping can help me control memory
 even
 though it's running on VM eventually?
 
 Thanks,
 Mark
 
 On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl 
 charles.ce...@gmail.com
 wrote:
 
 Mark,
 Both streaming and pipes allow this, perhaps more so pipes at the
 level
 of
 the mapreduce task. Can you provide more details on the application?
 On Feb 29, 2012, at 1:56 PM, Mark question wrote:
 
 Hi guys, thought I should ask this before I use it ... will using C
 over
 Hadoop give me the usual C memory management? For example, malloc() ,
 sizeof() ? My guess is no since this all will eventually be turned
 into
 bytecode, but I need more control on memory which obviously is hard
 for
 me
 to do with Java.
 
 Let me know of any advantages you know about streaming in C over
 hadoop.
 Thank you,
 Mark
 
 
 
 
 
 



Re: Streaming Hadoop using C

2012-02-29 Thread Charles Earl
Mark,
Both streaming and pipes allow this, perhaps more so pipes at the level of the 
mapreduce task. Can you provide more details on the application?
On Feb 29, 2012, at 1:56 PM, Mark question wrote:

 Hi guys, thought I should ask this before I use it ... will using C over
 Hadoop give me the usual C memory management? For example, malloc() ,
 sizeof() ? My guess is no since this all will eventually be turned into
 bytecode, but I need more control on memory which obviously is hard for me
 to do with Java.
 
 Let me know of any advantages you know about streaming in C over hadoop.
 Thank you,
 Mark



Re: Streaming Hadoop using C

2012-02-29 Thread Charles Earl
Mark,
So if I understand, it is more the memory management that you are interested 
in, rather than a need to run an existing C or C++ application in MapReduce 
platform?
Have you done profiling of the application?
C
On Feb 29, 2012, at 2:19 PM, Mark question wrote:

 Thanks Charles .. I'm running Hadoop for research to perform duplicate
 detection methods. To go deeper, I need to understand what's slowing my
 program, which usually starts with analyzing memory to predict best input
 size for map task. So you're saying piping can help me control memory even
 though it's running on VM eventually?
 
 Thanks,
 Mark
 
 On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.comwrote:
 
 Mark,
 Both streaming and pipes allow this, perhaps more so pipes at the level of
 the mapreduce task. Can you provide more details on the application?
 On Feb 29, 2012, at 1:56 PM, Mark question wrote:
 
 Hi guys, thought I should ask this before I use it ... will using C over
 Hadoop give me the usual C memory management? For example, malloc() ,
 sizeof() ? My guess is no since this all will eventually be turned into
 bytecode, but I need more control on memory which obviously is hard for
 me
 to do with Java.
 
 Let me know of any advantages you know about streaming in C over hadoop.
 Thank you,
 Mark
 
 



Re: Streaming Hadoop using C

2012-02-29 Thread Charles Earl
The documentation on Starfish http://www.cs.duke.edu/starfish/index.html
looks promising , I have not used it. I wonder if others on the list have found 
it more useful than setting mapred.task.profile.
C
On Feb 29, 2012, at 3:53 PM, Mark question wrote:

 I've used hadoop profiling (.prof) to show the stack trace but it was hard
 to follow. jConsole locally since I couldn't find a way to set a port
 number to child processes when running them remotely. Linux commands
 (top,/proc), showed me that the virtual memory is almost twice as my
 physical which means swapping is happening which is what I'm trying to
 avoid.
 
 So basically, is there a way to assign a port to child processes to monitor
 them remotely (asked before by Xun) or would you recommend another
 monitoring tool?
 
 Thank you,
 Mark
 
 
 On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.comwrote:
 
 Mark,
 So if I understand, it is more the memory management that you are
 interested in, rather than a need to run an existing C or C++ application
 in MapReduce platform?
 Have you done profiling of the application?
 C
 On Feb 29, 2012, at 2:19 PM, Mark question wrote:
 
 Thanks Charles .. I'm running Hadoop for research to perform duplicate
 detection methods. To go deeper, I need to understand what's slowing my
 program, which usually starts with analyzing memory to predict best input
 size for map task. So you're saying piping can help me control memory
 even
 though it's running on VM eventually?
 
 Thanks,
 Mark
 
 On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.com
 wrote:
 
 Mark,
 Both streaming and pipes allow this, perhaps more so pipes at the level
 of
 the mapreduce task. Can you provide more details on the application?
 On Feb 29, 2012, at 1:56 PM, Mark question wrote:
 
 Hi guys, thought I should ask this before I use it ... will using C
 over
 Hadoop give me the usual C memory management? For example, malloc() ,
 sizeof() ? My guess is no since this all will eventually be turned into
 bytecode, but I need more control on memory which obviously is hard for
 me
 to do with Java.
 
 Let me know of any advantages you know about streaming in C over
 hadoop.
 Thank you,
 Mark
 
 
 
 



Re: Streaming Hadoop using C

2012-02-29 Thread Charles Earl
I assume you have also just tried running locally and using the jdk performance 
tools (e.g. jmap) to gain insight by configuring hadoop to run absolute minimum 
number of tasks?
Perhaps the discussion
http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task
might be relevant?
On Feb 29, 2012, at 3:53 PM, Mark question wrote:

 I've used hadoop profiling (.prof) to show the stack trace but it was hard
 to follow. jConsole locally since I couldn't find a way to set a port
 number to child processes when running them remotely. Linux commands
 (top,/proc), showed me that the virtual memory is almost twice as my
 physical which means swapping is happening which is what I'm trying to
 avoid.
 
 So basically, is there a way to assign a port to child processes to monitor
 them remotely (asked before by Xun) or would you recommend another
 monitoring tool?
 
 Thank you,
 Mark
 
 
 On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.comwrote:
 
 Mark,
 So if I understand, it is more the memory management that you are
 interested in, rather than a need to run an existing C or C++ application
 in MapReduce platform?
 Have you done profiling of the application?
 C
 On Feb 29, 2012, at 2:19 PM, Mark question wrote:
 
 Thanks Charles .. I'm running Hadoop for research to perform duplicate
 detection methods. To go deeper, I need to understand what's slowing my
 program, which usually starts with analyzing memory to predict best input
 size for map task. So you're saying piping can help me control memory
 even
 though it's running on VM eventually?
 
 Thanks,
 Mark
 
 On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.com
 wrote:
 
 Mark,
 Both streaming and pipes allow this, perhaps more so pipes at the level
 of
 the mapreduce task. Can you provide more details on the application?
 On Feb 29, 2012, at 1:56 PM, Mark question wrote:
 
 Hi guys, thought I should ask this before I use it ... will using C
 over
 Hadoop give me the usual C memory management? For example, malloc() ,
 sizeof() ? My guess is no since this all will eventually be turned into
 bytecode, but I need more control on memory which obviously is hard for
 me
 to do with Java.
 
 Let me know of any advantages you know about streaming in C over
 hadoop.
 Thank you,
 Mark
 
 
 
 



Extending pipes to support binary data

2012-02-14 Thread Charles Earl
Hi,
I'm trying to extend the pipes interface as defined in Pipes.hh to
support the read of binary input data.
I believe that would mean extending the getInputValue() method of
context to return char *, which would then be memcpy'd to appropriate
type inside the C++ pipes program.
I'm guessing the best way  to do this would be to use a custom
InputFormat on the java side that would have BytesWritable value.
Is this the correct approach?

-- 
- Charles