Re: Hadoop-on-demand and torque
Ralph, Do you have any YARN or Mesos performance comparison against HOD? I suppose since it was customer requirement you might not have explored it. MPI support seems to be active issue for Mesos now. Charles On May 21, 2012, at 10:36 AM, Ralph Castain r...@open-mpi.org wrote: Not quite yet, though we are working on it (some descriptive stuff is around, but needs to be consolidated). Several of us started working together a couple of months ago to support the MapReduce programming model on HPC clusters using Open MPI as the platform. In working with our customers and OMPI's wide community of users, we found that people were interested in this capability, wanted to integrate MPI support into their MapReduce jobs, and didn't want to migrate their clusters to YARN for various reasons. We have released initial versions of two new tools in the OMPI developer's trunk, scheduled for inclusion in the upcoming 1.7.0 release: 1. mr+ - executes the MapReduce programming paradigm. Currently, we only support streaming data, though we will extend that support shortly. All HPC environments (rsh, SLURM, Torque, Alps, LSF, Windows, etc.) are supported. Both mappers and reducers can utilize MPI (independently or in combination) if they so choose. Mappers and reducers can be written in any of the typical HPC languages (C, C++, and Fortran) as well as Java (note: OMPI now comes with Java MPI bindings). 2. hdfsalloc - takes a list of files and obtains a resource allocation for the nodes upon which those files reside. SLURM and Moab/Maui are currently supported, with Gridengine coming soon. There will be a public announcement of this in the near future, and we expect to integrate the Hadoop 1.0 and Hadoop 2.0 MR classes over the next couple of months. By the end of this summer, we should have a full-featured public release. On May 20, 2012, at 2:10 PM, Brian Bockelman wrote: Hi Ralph, I admit - I've only been half-following the OpenMPI progress. Do you have a technical write-up of what has been done? Thanks, Brian On May 20, 2012, at 9:31 AM, Ralph Castain wrote: FWIW: Open MPI now has an initial cut at MR+ that runs map-reduce under any HPC environment. We don't have the Java integration yet to support the Hadoop MR class, but you can write a mapper/reducer and execute that programming paradigm. We plan to integrate the Hadoop MR class soon. If you already have that integration, we'd love to help port it over. We already have the MPI support completed, so any mapper/reducer could use it. On May 20, 2012, at 7:12 AM, Pierre Antoine DuBoDeNa wrote: We run similar infrastructure in a university project.. we plan to install hadoop.. and looking for alternatives based on hadoop in case the pure hadoop is not working as expected. Keep us updated on the code release. Best, PA 2012/5/20 Stijn De Weirdt stijn.dewei...@ugent.be hi all, i'm part of an HPC group of a university, and we have some users that are interested in Hadoop to see if it can be useful in their research and we also have researchers that are using hadoop already on their own infrastructure, but that is is not enough reason for us to start with dedicated dedicated Hadoop infrastructure (we are now only running torque based clusters with and without shared storage; setting up and properly maintaining Hadoop infrastructure requires quite some understanding of new software) to be able to support these needs we wanted to do just this: use current HPC infrastructure to make private hadoop clusters so people can do some work. if we attract enough interest, we will probably setup dedicated infrastructure, but by that time we (the admins) will also have a better understanding of what is required. so we used to look at HOD for testing/running hadoop on existing infrastructure (never really looked at myhadoop though). but (imho) the current HOD code base is not in such a good state. we did some work to get it working and added some features, to come to the conclusion that it was not sufficient (and not maintainable). so we wrote something from scratch with same functionality as HOD, and much more (eg HBase is now possible, with or without MR1; some default tuning; easy to add support for yarn instead of MR1). it has some suport for torque, but my laptop is also sufficient. (the torque support is a wrapper to submit the job) we gave a workshop on hadoop using it (25 people, and each with their own 5 node hadoop cluster) and it went rather well. it's not in a public repo yet, but we could do that. if interested, let me know, and i see what can be done. (releasing the code is on our todo list, but if there is some demand, we can do it sooner) stijn On 05/18/2012 05:07 PM, Pierre Antoine DuBoDeNa wrote: I am also interested to learn about myHadoop as I use a shared storage system and everything runs on VMs and not
Re: Text Analysis
If you've got existing R code, you might want to look at this http://www.quora.com/How-can-R-and-Hadoop-be-used-together. Quora posting, also by Cloudera, or the rhipe R Hadoop package https://github.com/saptarshiguha/RHIPE/wiki Mahout and Lucene/Solr offer some level of text analysis, although I would not call these complete text analysis packages. What I've found are specific algorithms as opposed to a complete package: for example LDA for topic discovery -- Mahout and Yahoo Research (https://github.com/shravanmn/Yahoo_LDA) have Hadoop based implementations -- in the case of Yahoo_LDA the data is stored in HDFS, while the computation is essentially MPI based. Whether the algorithm reads data from HDFS store and uses another approach other than map reduce is another question. C On Apr 25, 2012, at 12:47 PM, Jagat wrote: There are Api which you can use , offcourse they are third party. --- Sent from Mobile , short and crisp. On 25-Apr-2012 8:57 PM, Robert Evans ev...@yahoo-inc.com wrote: Hadoop itself is the core Map/Reduce and HDFS functionality. The higher level algorithms like sentiment analysis are often done by others. Cloudera has a video from HadoopWorld 2010 about it http://www.cloudera.com/resource/hw10_video_sentiment_analysis_powered_by_hadoop/ And there are likely to be other tools like R that can help you out with it. I am not really sure if mahout offers sentiment analysis or not, but you might want to look there too http://mahout.apache.org/ --Bobby Evans On 4/25/12 7:50 AM, karanveer.si...@barclays.com karanveer.si...@barclays.com wrote: Hi, I wanted to know if there are any existing API's within Hadoop for us to do some text analysis like sentiment analysis, etc. OR are we to rely on tools like R, etc. for this. Regards, Karanveer This e-mail and any attachments are confidential and intended solely for the addressee and may also be privileged or exempt from disclosure under applicable law. If you are not the addressee, or have received this e-mail in error, please notify the sender immediately, delete it from your system and do not copy, disclose or otherwise act upon any part of this e-mail or its attachments. Internet communications are not guaranteed to be secure or virus-free. The Barclays Group does not accept responsibility for any loss arising from unauthorised access to, or interference with, any Internet communications by any third party, or from the transmission of any viruses. Replies to this e-mail may be monitored by the Barclays Group for operational or business reasons. Any opinion or other information in this e-mail or its attachments that does not relate to the business of the Barclays Group is personal to the sender and is not given or endorsed by the Barclays Group. Barclays Bank PLC. Registered in England and Wales (registered no. 1026167). Registered Office: 1 Churchill Place, London, E14 5HP, United Kingdom. Barclays Bank PLC is authorised and regulated by the Financial Services Authority.
Re: Hadoop streaming or pipes ..
Also bear in mind that there is a kind of detour involved, in the sense that a pipes map must send key,value data back to the Java process and then to reduce (more or less). I think that the Hadoop C Extension (HCE, there is a patch) is supposed to be faster. Would be interested to know if the community has any experience with HCE performance. C On Apr 5, 2012, at 3:49 PM, Robert Evans ev...@yahoo-inc.com wrote: Both streaming and pipes do very similar things. They will fork/exec a separate process that is running whatever you want it to run. The JVM that is running hadoop then communicates with this process to send the data over and get the processing results back. The difference between streaming and pipes is that streaming uses stdin/stdout for this communication so preexisting processing like grep, sed and awk can be used here. Pipes uses a custom protocol with a C++ library to communicate. The C++ library is tagged with SWIG compatible data so that it can be wrapped to have APIs in other languages like python or perl. I am not sure what the performance difference is between the two, but in my own work I have seen a significant performance penalty from using either of them, because there is a somewhat large overhead of sending all of the data out to a separate process just to read it back in again. --Bobby Evans On 4/5/12 1:54 PM, Mark question markq2...@gmail.com wrote: Hi guys, quick question: Are there any performance gains from hadoop streaming or pipes over Java? From what I've read, it's only to ease testing by using your favorite language. So I guess it is eventually translated to bytecode then executed. Is that true? Thank you, Mark
Re: Streaming Hadoop using C
How was your experience of starfish? C On Mar 1, 2012, at 12:35 AM, Mark question wrote: Thank you for your time and suggestions, I've already tried starfish, but not jmap. I'll check it out. Thanks again, Mark On Wed, Feb 29, 2012 at 1:17 PM, Charles Earl charles.ce...@gmail.comwrote: I assume you have also just tried running locally and using the jdk performance tools (e.g. jmap) to gain insight by configuring hadoop to run absolute minimum number of tasks? Perhaps the discussion http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task might be relevant? On Feb 29, 2012, at 3:53 PM, Mark question wrote: I've used hadoop profiling (.prof) to show the stack trace but it was hard to follow. jConsole locally since I couldn't find a way to set a port number to child processes when running them remotely. Linux commands (top,/proc), showed me that the virtual memory is almost twice as my physical which means swapping is happening which is what I'm trying to avoid. So basically, is there a way to assign a port to child processes to monitor them remotely (asked before by Xun) or would you recommend another monitoring tool? Thank you, Mark On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.com wrote: Mark, So if I understand, it is more the memory management that you are interested in, rather than a need to run an existing C or C++ application in MapReduce platform? Have you done profiling of the application? C On Feb 29, 2012, at 2:19 PM, Mark question wrote: Thanks Charles .. I'm running Hadoop for research to perform duplicate detection methods. To go deeper, I need to understand what's slowing my program, which usually starts with analyzing memory to predict best input size for map task. So you're saying piping can help me control memory even though it's running on VM eventually? Thanks, Mark On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.com wrote: Mark, Both streaming and pipes allow this, perhaps more so pipes at the level of the mapreduce task. Can you provide more details on the application? On Feb 29, 2012, at 1:56 PM, Mark question wrote: Hi guys, thought I should ask this before I use it ... will using C over Hadoop give me the usual C memory management? For example, malloc() , sizeof() ? My guess is no since this all will eventually be turned into bytecode, but I need more control on memory which obviously is hard for me to do with Java. Let me know of any advantages you know about streaming in C over hadoop. Thank you, Mark
Re: Streaming Hadoop using C
Mark, Both streaming and pipes allow this, perhaps more so pipes at the level of the mapreduce task. Can you provide more details on the application? On Feb 29, 2012, at 1:56 PM, Mark question wrote: Hi guys, thought I should ask this before I use it ... will using C over Hadoop give me the usual C memory management? For example, malloc() , sizeof() ? My guess is no since this all will eventually be turned into bytecode, but I need more control on memory which obviously is hard for me to do with Java. Let me know of any advantages you know about streaming in C over hadoop. Thank you, Mark
Re: Streaming Hadoop using C
Mark, So if I understand, it is more the memory management that you are interested in, rather than a need to run an existing C or C++ application in MapReduce platform? Have you done profiling of the application? C On Feb 29, 2012, at 2:19 PM, Mark question wrote: Thanks Charles .. I'm running Hadoop for research to perform duplicate detection methods. To go deeper, I need to understand what's slowing my program, which usually starts with analyzing memory to predict best input size for map task. So you're saying piping can help me control memory even though it's running on VM eventually? Thanks, Mark On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.comwrote: Mark, Both streaming and pipes allow this, perhaps more so pipes at the level of the mapreduce task. Can you provide more details on the application? On Feb 29, 2012, at 1:56 PM, Mark question wrote: Hi guys, thought I should ask this before I use it ... will using C over Hadoop give me the usual C memory management? For example, malloc() , sizeof() ? My guess is no since this all will eventually be turned into bytecode, but I need more control on memory which obviously is hard for me to do with Java. Let me know of any advantages you know about streaming in C over hadoop. Thank you, Mark
Re: Streaming Hadoop using C
The documentation on Starfish http://www.cs.duke.edu/starfish/index.html looks promising , I have not used it. I wonder if others on the list have found it more useful than setting mapred.task.profile. C On Feb 29, 2012, at 3:53 PM, Mark question wrote: I've used hadoop profiling (.prof) to show the stack trace but it was hard to follow. jConsole locally since I couldn't find a way to set a port number to child processes when running them remotely. Linux commands (top,/proc), showed me that the virtual memory is almost twice as my physical which means swapping is happening which is what I'm trying to avoid. So basically, is there a way to assign a port to child processes to monitor them remotely (asked before by Xun) or would you recommend another monitoring tool? Thank you, Mark On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.comwrote: Mark, So if I understand, it is more the memory management that you are interested in, rather than a need to run an existing C or C++ application in MapReduce platform? Have you done profiling of the application? C On Feb 29, 2012, at 2:19 PM, Mark question wrote: Thanks Charles .. I'm running Hadoop for research to perform duplicate detection methods. To go deeper, I need to understand what's slowing my program, which usually starts with analyzing memory to predict best input size for map task. So you're saying piping can help me control memory even though it's running on VM eventually? Thanks, Mark On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.com wrote: Mark, Both streaming and pipes allow this, perhaps more so pipes at the level of the mapreduce task. Can you provide more details on the application? On Feb 29, 2012, at 1:56 PM, Mark question wrote: Hi guys, thought I should ask this before I use it ... will using C over Hadoop give me the usual C memory management? For example, malloc() , sizeof() ? My guess is no since this all will eventually be turned into bytecode, but I need more control on memory which obviously is hard for me to do with Java. Let me know of any advantages you know about streaming in C over hadoop. Thank you, Mark
Re: Streaming Hadoop using C
I assume you have also just tried running locally and using the jdk performance tools (e.g. jmap) to gain insight by configuring hadoop to run absolute minimum number of tasks? Perhaps the discussion http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task might be relevant? On Feb 29, 2012, at 3:53 PM, Mark question wrote: I've used hadoop profiling (.prof) to show the stack trace but it was hard to follow. jConsole locally since I couldn't find a way to set a port number to child processes when running them remotely. Linux commands (top,/proc), showed me that the virtual memory is almost twice as my physical which means swapping is happening which is what I'm trying to avoid. So basically, is there a way to assign a port to child processes to monitor them remotely (asked before by Xun) or would you recommend another monitoring tool? Thank you, Mark On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.comwrote: Mark, So if I understand, it is more the memory management that you are interested in, rather than a need to run an existing C or C++ application in MapReduce platform? Have you done profiling of the application? C On Feb 29, 2012, at 2:19 PM, Mark question wrote: Thanks Charles .. I'm running Hadoop for research to perform duplicate detection methods. To go deeper, I need to understand what's slowing my program, which usually starts with analyzing memory to predict best input size for map task. So you're saying piping can help me control memory even though it's running on VM eventually? Thanks, Mark On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.com wrote: Mark, Both streaming and pipes allow this, perhaps more so pipes at the level of the mapreduce task. Can you provide more details on the application? On Feb 29, 2012, at 1:56 PM, Mark question wrote: Hi guys, thought I should ask this before I use it ... will using C over Hadoop give me the usual C memory management? For example, malloc() , sizeof() ? My guess is no since this all will eventually be turned into bytecode, but I need more control on memory which obviously is hard for me to do with Java. Let me know of any advantages you know about streaming in C over hadoop. Thank you, Mark
Extending pipes to support binary data
Hi, I'm trying to extend the pipes interface as defined in Pipes.hh to support the read of binary input data. I believe that would mean extending the getInputValue() method of context to return char *, which would then be memcpy'd to appropriate type inside the C++ pipes program. I'm guessing the best way to do this would be to use a custom InputFormat on the java side that would have BytesWritable value. Is this the correct approach? -- - Charles