Re: SequenceFiles, checkpoints, block size (Was: How to flush SequenceFile.Writer?)
Hi Brian, Writes to HDFS are not guaranteed to be flushed until the file is closed. In practice, as each (64MB) block is finished it is flushed and will be visible to other readers, which is what you were seeing. The addition of appends in HDFS changes this and adds a sync() method to FSDataOutputStream. You can read about the semantics of the new operations here: https://issues.apache.org/jira/secure/attachment/12370562/Appends.doc. Unfortunately, there are some problems with sync() that are still being worked through (https://issues.apache.org/jira/browse/HADOOP-4379). Also, even with sync() working, the append() on SequenceFile does not do an implicit sync() - it is not atomic. Furthermore, there is no way to get hold of the FSDataOutputStream to call sync() yourself - see https://issues.apache.org/jira/browse/HBASE-1155. (And don't get confused by the sync() method on SequenceFile.Writer - it is for another purpose entirely.) As Jason points out, the simplest way to achieve what you're trying to so is to close the file and start a new one. If you start to get too many small files, then you can have another process to merge the smaller files in the background. Tom On Tue, Feb 3, 2009 at 3:57 AM, jason hadoop jason.had...@gmail.com wrote: If you have to do a time based solution, for now, simply close the file and stage it, then open a new file. Your reads will have to deal with the fact the file is in multiple parts. Warning: Datanodes get pokey if they have large numbers of blocks, and the quickest way to do this is to create a lot of small files. On Mon, Feb 2, 2009 at 9:54 AM, Brian Long br...@dotspots.com wrote: Let me rephrase this problem... as stated below, when I start writing to a SequenceFile from an HDFS client, nothing is visible in HDFS until I've written 64M of data. This presents three problems: fsck reports the file system as corrupt until the first block is finally written out, the presence of the file (without any data) seems to blow up my mapred jobs that try to make use of it under my input path, and finally, I want to basically flush every 15 minutes or so so I can mapred the latest data. I don't see any programmatic way to force the file to flush in 17.2. Additionally, dfs.checkpoint.period does not seem to be obeyed. Does that not do what I think it does? What controls the 64M limit, anyway? Is it dfs.checkpoint.size or dfs.block.size? Is the buffering happening on the client, or on data nodes? Or in the namenode? It seems really bad that a SequenceFile, upon creation, is in an unusable state from the perspective of a mapred job, and also leaves fsck in a corrupt state. Surely I must be doing something wrong... but what? How can I ensure that a SequenceFile is immediately usable (but empty) on creation, and how can I make things flush on some regular time interval? Thanks, Brian On Thu, Jan 29, 2009 at 4:17 PM, Brian Long br...@dotspots.com wrote: I have a SequenceFile.Writer that I obtained via SequenceFile.createWriter and write to using append(key, value). Because the writer volume is low, it's not uncommon for it to take over a day for my appends to finally be flushed to HDFS (e.g. the new file will sit at 0 bytes for over a day). Because I am running map/reduce tasks on this data multiple times a day, I want to flush the sequence file so the mapred jobs can pick it up when they run. What's the right way to do this? I'm assuming it's a fairly common use case. Also -- are writes to the sequence files atomic? (e.g. if I am actively appending to a sequence file, is it always safe to read from that same file in a mapred job?) To be clear, I want the flushing to be time based (controlled explicitly by the app), not size based. Will this create waste in HDFS somehow? Thanks, Brian
Re: hadoop to ftp files into hdfs
NLineInputFormat is ideal for this purpose. Each split will be N lines of input (where N is configurable), so each mapper can retrieve N files for insertion into HDFS. You can set the number of redcers to zero. Tom On Tue, Feb 3, 2009 at 4:23 AM, jason hadoop jason.had...@gmail.com wrote: If you have a large number of ftp urls spread across many sites, simply set that file to be your hadoop job input, and force the input split to be a size that gives you good distribution across your cluster. On Mon, Feb 2, 2009 at 3:23 PM, Steve Morin steve.mo...@gmail.com wrote: Does any one have a good suggestion on how to submit a hadoop job that will split the ftp retrieval of a number of files for insertion into hdfs? I have been searching google for suggestions on this matter. Steve
Hadoop-KFS-FileSystem API
Hi, I am looking to use KFS as storage with Hadoop FileSystem API. http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/fs/kfs/package-summary.html This page states about KFS usage with Hadoop and stated as last step to run map/reduce tracker. Is it necessary to turn it on? How only storage works with FileSystem API ? Thanks Wasim
Re: problem with completion notification from block movement
On Mon, 2009-02-02 at 20:06 -0800, jason hadoop wrote: This can be made significantly worse by your underlying host file system and the disks that support it. Oh, yes, we know... It was a late-realized mistake just yesterday that we weren't using noatime on that cluster's slaves. The attached graph is instructive. We have our nightly-rotated logs for DataNode all the way back to when this test cluster was created in November. This morning on one node, I sampled the first 10 BlockReport scan lines from each day's log, up through the current hour today, and handed it to gnuplot to graph. The seriously erratic behavior that begins around the 900K-1M point is very disturbing. Immediate solutions for us include noatime, nodiratime, BIOS upgrade on the discs, and eliminating enough small files (blocks) in DFS to get the total count below 400K.
Is hadoop right for my problem
I have an application I would like to apply hadoop to but I'm not sure if the tasking is too small. I have a file that contains between 70,000 - 400,000 records. All the records can be processed in parallel and I can currently process them at 400 records a second single threaded (give or take). I thought I read somewhere (one of the tutorials) that the mapper tasks should run at least for a minute to offset the overhead in creating them. Is this really the case? I am pretty sure that a one to one record to mapper is overkill but I am wondering if I batching them up for the mapper is still a way to go or if I should look at some other framework to help split up the processing. Any insight would be appreciated. Thanks Chris -- View this message in context: http://www.nabble.com/Is-hadoop-right-for-my-problem-tp21811122p21811122.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Hadoop-KFS-FileSystem API
Hi Wasim, Here is what you could do. 1. Deploy KFS 2. Build a hadoop-site.xml config file and set fs.default.name and other config variables to point to KFS as described by (http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/fs/kfs/package-summary.html) 3. If you place this hadoop-site.xml is a directory say ~/foo/myconf, then you could use hadoop commands as ./bin/hadoop --config ~/foo/myconf fs -ls /foo/bar.txt 4. If you want to use Hadoop FileSysemt API, just put this directory as the first entry in your classpath, so that new configuration object loads this hadoop-site.xml and your FileSystem API talk to KFS. 5. Alternatively you could also create an object of KosmosFileSystem, which extends from FileSystem. Look at org.apache.hadoop.fs.kfs.KosmosFileSystem for example. Lohit - Original Message From: Wasim Bari wasimb...@msn.com To: core-user@hadoop.apache.org Sent: Tuesday, February 3, 2009 3:03:51 AM Subject: Hadoop-KFS-FileSystem API Hi, I am looking to use KFS as storage with Hadoop FileSystem API. http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/fs/kfs/package-summary.html This page states about KFS usage with Hadoop and stated as last step to run map/reduce tracker. Is it necessary to turn it on? How only storage works with FileSystem API ? Thanks Wasim
HDD benchmark/checking tool
Dear hadoop users, Recently I have had a number of drive failures that slowed down processes a lot until they were discovered. It is there any easy way or tool, to check HDD performance and see if there any IO errors? Currently I wrote a simple script that looks at /var/log/messages and greps everything abnormal for /dev/sdaX. But if you have better solution I'd appreciate if you share it. --- Dmitry Pushkarev +1-650-644-8988
Re: Is hadoop right for my problem
Hey Chris, I think it would be appropriate. Look at it this way, it takes 1 mapper 1 minute to process 24k records, so it should take about 17 mappers to process all your tasks for the largest problem in one minute. Even if you still think your problem is too small, consider: 1) The possibility of growth in your application. You're processing becomes future proof - you have a pretty solid way to scale out as your task grows. Just add new machines -- you don't have to invest in a small scale framework then rewrite in a year. 2) The benefits of having a framework do the heavy lifting. There's a surprising amount of roll your own that you end up doing when you decide to break out of a single thread. By framing your problem as a map-reduce problem, you get to skip a lot of these steps and just focus on solving your problem (also: beware that it's very sexy to build your own MapReduce framework. Anything which is very sexy takes up more time and money than you think possible at the outset). Brian On Feb 3, 2009, at 8:34 AM, cdwillie76 wrote: I have an application I would like to apply hadoop to but I'm not sure if the tasking is too small. I have a file that contains between 70,000 - 400,000 records. All the records can be processed in parallel and I can currently process them at 400 records a second single threaded (give or take). I thought I read somewhere (one of the tutorials) that the mapper tasks should run at least for a minute to offset the overhead in creating them. Is this really the case? I am pretty sure that a one to one record to mapper is overkill but I am wondering if I batching them up for the mapper is still a way to go or if I should look at some other framework to help split up the processing. Any insight would be appreciated. Thanks Chris -- View this message in context: http://www.nabble.com/Is-hadoop-right-for-my-problem-tp21811122p21811122.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Cascading support of HBase
Hey all, Just wanted to let everyone know the HBase Hackathon was a success (thanks Streamy!), and we now have Cascading Tap adapters for HBase. You can find the link here (along with additional third-party extensions in the near future). http://www.cascading.org/modules.html Please feel free to give it a test, clone the repo, and submit patches back to me via GitHub. The unit test shows how simple it is to import/export data to/from an HBase cluster. http://bit.ly/fIpAE For those not familiar with Cascading, it is an alternative processing API that is generally more natural to develop and think in than MapReduce. http://www.cascading.org/ enjoy, chris p.s. If you have any code you want to contribute back, just stick it on GitHub and send me a link. -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/
extreme nubbie need help setting up hadoop
Good afternoon all, I work tech and an extreme nubbie at hadoop. I could sure use some help. I have a professor wanting hadoop installed on multiple Linux computers in a lab. The computers are running CentOS 5. I know i have something configured wrong and am not sure where to go. I am following the instructions at http://www.cs.brandeis.edu/~cs147a/lab/hadoop-cluster/ I get to the part Testing Your Hadoop Cluster but when i use the command hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+' It hangs. Could anyone be kind enough to point me to a step by step instillation and configuration website? Thank you Brian
Re: extreme nubbie need help setting up hadoop
Hello Brian , Here is the Hadoop project Wiki link which covers detailed Hadoop setup and running your first program on Single node as well as on multiple nodes . Also below are some more useful links to start understanding and using Hadoop. http://hadoop.apache.org/core/docs/current/quickstart.html http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster) http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster) If you still have difficulties running Hadoop based programs please reply with error output so that experts can comment. - Ravi On 2/3/09 10:08 AM, bjday bj...@cse.usf.edu wrote: Good afternoon all, I work tech and an extreme nubbie at hadoop. I could sure use some help. I have a professor wanting hadoop installed on multiple Linux computers in a lab. The computers are running CentOS 5. I know i have something configured wrong and am not sure where to go. I am following the instructions at http://www.cs.brandeis.edu/~cs147a/lab/hadoop-cluster/ I get to the part Testing Your Hadoop Cluster but when i use the command hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+' It hangs. Could anyone be kind enough to point me to a step by step instillation and configuration website? Thank you Brian Ravi Phulari Yahoo! IM : ravescorp |Office Phone: (408)-336-0806 | --
Re: Control over max map/reduce tasks per job
Hey Jonathan Are you looking to limit the total number of concurrent mapper/ reducers a single job can consume cluster wide, or limit the number per node? That is, you have X mappers/reducers, but only can allow N mappers/ reducers to run at a time globally, for a given job. Or, you are cool with all X running concurrently globally, but want to guarantee that no node can run more than N tasks from that job? Or both? just reconciling the conversation we had last week with this thread. ckw On Feb 3, 2009, at 11:16 AM, Jonathan Gray wrote: All, I have a few relatively small clusters (5-20 nodes) and am having trouble keeping them loaded with my MR jobs. The primary issue is that I have different jobs that have drastically different patterns. I have jobs that read/write to/from HBase or Hadoop with minimal logic (network throughput bound or io bound), others that perform crawling (network latency bound), and one huge parsing streaming job (very CPU bound, each task eats a core). I'd like to launch very large numbers of tasks for network latency bound jobs, however the large CPU bound job means I have to keep the max maps allowed per node low enough as to not starve the Datanode and Regionserver. I'm an HBase dev but not familiar enough with Hadoop MR code to even know what would be involved with implementing this. However, in talking with other users, it seems like this would be a well-received option. I wanted to ping the list before filing an issue because it seems like someone may have thought about this in the past. Thanks. Jonathan Gray -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/
RE: Control over max map/reduce tasks per job
Chris, For my specific use cases, it would be best to be able to set N mappers/reducers per job per node (so I can explicitly say, run at most 2 at a time of this CPU bound task on any given node). However, the other way would work as well (on 10 node system, would set job to max 20 tasks at a time globally), but opens up the possibility that a node could be assigned more than 2 of that task. I would work with whatever is easiest to implement as either would be a vast improvement for me (can run high numbers of network latency bound tasks without fear of cpu bound tasks killing the cluster). JG -Original Message- From: Chris K Wensel [mailto:ch...@wensel.net] Sent: Tuesday, February 03, 2009 11:34 AM To: core-user@hadoop.apache.org Subject: Re: Control over max map/reduce tasks per job Hey Jonathan Are you looking to limit the total number of concurrent mapper/ reducers a single job can consume cluster wide, or limit the number per node? That is, you have X mappers/reducers, but only can allow N mappers/ reducers to run at a time globally, for a given job. Or, you are cool with all X running concurrently globally, but want to guarantee that no node can run more than N tasks from that job? Or both? just reconciling the conversation we had last week with this thread. ckw On Feb 3, 2009, at 11:16 AM, Jonathan Gray wrote: All, I have a few relatively small clusters (5-20 nodes) and am having trouble keeping them loaded with my MR jobs. The primary issue is that I have different jobs that have drastically different patterns. I have jobs that read/write to/from HBase or Hadoop with minimal logic (network throughput bound or io bound), others that perform crawling (network latency bound), and one huge parsing streaming job (very CPU bound, each task eats a core). I'd like to launch very large numbers of tasks for network latency bound jobs, however the large CPU bound job means I have to keep the max maps allowed per node low enough as to not starve the Datanode and Regionserver. I'm an HBase dev but not familiar enough with Hadoop MR code to even know what would be involved with implementing this. However, in talking with other users, it seems like this would be a well-received option. I wanted to ping the list before filing an issue because it seems like someone may have thought about this in the past. Thanks. Jonathan Gray -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/
Re: Control over max map/reduce tasks per job
An alternative is to have 2 Tasktracker clusters, where the nodes are on the same machines. One cluster is for IO intensive jobs and has a low number of map/reduces per tracker, the other cluster is for cpu intensive jobs and has a high number of map/reduces per tracker. The alternative, simpler method is to use a multi-threaded mapper on the cpu intensive jobs, where you tune the thread count on a per job basis. In the longer term being able to alter the Tasktracker control parameters at run time on a per job basis would be wonderful. On Tue, Feb 3, 2009 at 12:01 PM, Nathan Marz nat...@rapleaf.com wrote: This is a great idea. For me, this is related to: https://issues.apache.org/jira/browse/HADOOP-5160. Being able to set the number of tasks per machine on a job by job basis would allow me to solve my problem in a different way. Looking at the Hadoop source, it's also probably simpler than changing how Hadoop schedules tasks. On Feb 3, 2009, at 11:44 AM, Jonathan Gray wrote: Chris, For my specific use cases, it would be best to be able to set N mappers/reducers per job per node (so I can explicitly say, run at most 2 at a time of this CPU bound task on any given node). However, the other way would work as well (on 10 node system, would set job to max 20 tasks at a time globally), but opens up the possibility that a node could be assigned more than 2 of that task. I would work with whatever is easiest to implement as either would be a vast improvement for me (can run high numbers of network latency bound tasks without fear of cpu bound tasks killing the cluster). JG -Original Message- From: Chris K Wensel [mailto:ch...@wensel.net] Sent: Tuesday, February 03, 2009 11:34 AM To: core-user@hadoop.apache.org Subject: Re: Control over max map/reduce tasks per job Hey Jonathan Are you looking to limit the total number of concurrent mapper/ reducers a single job can consume cluster wide, or limit the number per node? That is, you have X mappers/reducers, but only can allow N mappers/ reducers to run at a time globally, for a given job. Or, you are cool with all X running concurrently globally, but want to guarantee that no node can run more than N tasks from that job? Or both? just reconciling the conversation we had last week with this thread. ckw On Feb 3, 2009, at 11:16 AM, Jonathan Gray wrote: All, I have a few relatively small clusters (5-20 nodes) and am having trouble keeping them loaded with my MR jobs. The primary issue is that I have different jobs that have drastically different patterns. I have jobs that read/write to/from HBase or Hadoop with minimal logic (network throughput bound or io bound), others that perform crawling (network latency bound), and one huge parsing streaming job (very CPU bound, each task eats a core). I'd like to launch very large numbers of tasks for network latency bound jobs, however the large CPU bound job means I have to keep the max maps allowed per node low enough as to not starve the Datanode and Regionserver. I'm an HBase dev but not familiar enough with Hadoop MR code to even know what would be involved with implementing this. However, in talking with other users, it seems like this would be a well-received option. I wanted to ping the list before filing an issue because it seems like someone may have thought about this in the past. Thanks. Jonathan Gray -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/
Re: Control over max map/reduce tasks per job
Another use case for per-job task limits is being able to use every core in the cluster on a map-only job. On Feb 3, 2009, at 11:44 AM, Jonathan Gray wrote: Chris, For my specific use cases, it would be best to be able to set N mappers/reducers per job per node (so I can explicitly say, run at most 2 at a time of this CPU bound task on any given node). However, the other way would work as well (on 10 node system, would set job to max 20 tasks at a time globally), but opens up the possibility that a node could be assigned more than 2 of that task. I would work with whatever is easiest to implement as either would be a vast improvement for me (can run high numbers of network latency bound tasks without fear of cpu bound tasks killing the cluster). JG -Original Message- From: Chris K Wensel [mailto:ch...@wensel.net] Sent: Tuesday, February 03, 2009 11:34 AM To: core-user@hadoop.apache.org Subject: Re: Control over max map/reduce tasks per job Hey Jonathan Are you looking to limit the total number of concurrent mapper/ reducers a single job can consume cluster wide, or limit the number per node? That is, you have X mappers/reducers, but only can allow N mappers/ reducers to run at a time globally, for a given job. Or, you are cool with all X running concurrently globally, but want to guarantee that no node can run more than N tasks from that job? Or both? just reconciling the conversation we had last week with this thread. ckw On Feb 3, 2009, at 11:16 AM, Jonathan Gray wrote: All, I have a few relatively small clusters (5-20 nodes) and am having trouble keeping them loaded with my MR jobs. The primary issue is that I have different jobs that have drastically different patterns. I have jobs that read/write to/from HBase or Hadoop with minimal logic (network throughput bound or io bound), others that perform crawling (network latency bound), and one huge parsing streaming job (very CPU bound, each task eats a core). I'd like to launch very large numbers of tasks for network latency bound jobs, however the large CPU bound job means I have to keep the max maps allowed per node low enough as to not starve the Datanode and Regionserver. I'm an HBase dev but not familiar enough with Hadoop MR code to even know what would be involved with implementing this. However, in talking with other users, it seems like this would be a well-received option. I wanted to ping the list before filing an issue because it seems like someone may have thought about this in the past. Thanks. Jonathan Gray -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/
Re: FileInputFormat directory traversal
Hmm. Based on your reasons, an extension to FileInputFormat for the lib package seems more in order. I'll try to hack something up and file a Jira issue. Ian On Feb 3, 2009, at 4:28 PM, Doug Cutting wrote: Hi, Ian. One reason is that a MapFile is represented by a directory containing two files named index and data. SequenceFileInputFormat handles MapFiles too by, if an input file is a directory containing a data file, using that file. Another reason is that's what reduces generate. Neither reason implies that this is the best or only way of doing things. It would probably be better if FileInputFormat optionally supported recursive file enumeration. (It would be incompatible and thus cannot be the default mode.) Please file an issue in Jira for this and attach your patch. Thanks, Doug Ian Soboroff wrote: Is there a reason FileInputFormat only traverses the first level of directories in its InputPaths? (i.e., given an InputPath of 'foo', it will get foo/* but not foo/bar/*). I wrote a full depth-first traversal in my custom InputFormat which I can offer as a patch. But to do it I had to duplicate the PathFilter classes in FileInputFormat which are marked private, so a mainline patch would also touch FileInputFormat. Ian
Re: Control over max map/reduce tasks per job
This is a great idea. For me, this is related to: https://issues.apache.org/jira/browse/HADOOP-5160 . Being able to set the number of tasks per machine on a job by job basis would allow me to solve my problem in a different way. Looking at the Hadoop source, it's also probably simpler than changing how Hadoop schedules tasks. On Feb 3, 2009, at 11:44 AM, Jonathan Gray wrote: Chris, For my specific use cases, it would be best to be able to set N mappers/reducers per job per node (so I can explicitly say, run at most 2 at a time of this CPU bound task on any given node). However, the other way would work as well (on 10 node system, would set job to max 20 tasks at a time globally), but opens up the possibility that a node could be assigned more than 2 of that task. I would work with whatever is easiest to implement as either would be a vast improvement for me (can run high numbers of network latency bound tasks without fear of cpu bound tasks killing the cluster). JG -Original Message- From: Chris K Wensel [mailto:ch...@wensel.net] Sent: Tuesday, February 03, 2009 11:34 AM To: core-user@hadoop.apache.org Subject: Re: Control over max map/reduce tasks per job Hey Jonathan Are you looking to limit the total number of concurrent mapper/ reducers a single job can consume cluster wide, or limit the number per node? That is, you have X mappers/reducers, but only can allow N mappers/ reducers to run at a time globally, for a given job. Or, you are cool with all X running concurrently globally, but want to guarantee that no node can run more than N tasks from that job? Or both? just reconciling the conversation we had last week with this thread. ckw On Feb 3, 2009, at 11:16 AM, Jonathan Gray wrote: All, I have a few relatively small clusters (5-20 nodes) and am having trouble keeping them loaded with my MR jobs. The primary issue is that I have different jobs that have drastically different patterns. I have jobs that read/write to/from HBase or Hadoop with minimal logic (network throughput bound or io bound), others that perform crawling (network latency bound), and one huge parsing streaming job (very CPU bound, each task eats a core). I'd like to launch very large numbers of tasks for network latency bound jobs, however the large CPU bound job means I have to keep the max maps allowed per node low enough as to not starve the Datanode and Regionserver. I'm an HBase dev but not familiar enough with Hadoop MR code to even know what would be involved with implementing this. However, in talking with other users, it seems like this would be a well-received option. I wanted to ping the list before filing an issue because it seems like someone may have thought about this in the past. Thanks. Jonathan Gray -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/
hadoop dfs -test question (with a bit o' Ruby)
I'm at my wit's end. I want to do a simple test for the existence of a file on Hadoop. Here is the Ruby code I'm trying: val = `hadoop dfs -test -e s3n://holeinthebucket/user/hadoop/file.txt` puts Val: #{val} if val == 1 // do one thing else // do another end I never get a return value for the -test command. Is there something I'm missing? How should the return value be retrieved? Thanks, SD
Re: hadoop dfs -test question (with a bit o' Ruby)
Coincidentally I'm aware of the AWS::S3 package in Ruby but I'd prefer to avoid that... On Tue, Feb 3, 2009 at 5:02 PM, S D sd.codewarr...@gmail.com wrote: I'm at my wit's end. I want to do a simple test for the existence of a file on Hadoop. Here is the Ruby code I'm trying: val = `hadoop dfs -test -e s3n://holeinthebucket/user/hadoop/file.txt` puts Val: #{val} if val == 1 // do one thing else // do another end I never get a return value for the -test command. Is there something I'm missing? How should the return value be retrieved? Thanks, SD
Unable to pull data from DB in Hadoop job
I am trying to pull data out from an oracle database using a hadoop job and am getting the following error which I am unable to debug. 09/02/03 15:32:51 INFO mapred.JobClient: Task Id : attempt_200902012320_0022_m_01_2, Status : FAILED java.io.IOException: ORA-00933: SQL command not properly ended at org.apache.hadoop.mapred.lib.db.DBInputFormat.getRecordReader(DBInputFormat.java:289) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321) at org.apache.hadoop.mapred.Child.main(Child.java:155) java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) at LoadTable1.run(LoadTable1.java:130) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at LoadTable1.main(LoadTable1.java:107) Can anyone give me pointers on this? Thanks Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz
Re: FileInputFormat directory traversal
Hi, Ian. One reason is that a MapFile is represented by a directory containing two files named index and data. SequenceFileInputFormat handles MapFiles too by, if an input file is a directory containing a data file, using that file. Another reason is that's what reduces generate. Neither reason implies that this is the best or only way of doing things. It would probably be better if FileInputFormat optionally supported recursive file enumeration. (It would be incompatible and thus cannot be the default mode.) Please file an issue in Jira for this and attach your patch. Thanks, Doug Ian Soboroff wrote: Is there a reason FileInputFormat only traverses the first level of directories in its InputPaths? (i.e., given an InputPath of 'foo', it will get foo/* but not foo/bar/*). I wrote a full depth-first traversal in my custom InputFormat which I can offer as a patch. But to do it I had to duplicate the PathFilter classes in FileInputFormat which are marked private, so a mainline patch would also touch FileInputFormat. Ian
Re: HDD benchmark/checking tool
Dmitry, Look into cluster/system monitoring tools: nagios and ganglia are two to start with. - Aaron On Tue, Feb 3, 2009 at 9:53 AM, Dmitry Pushkarev u...@stanford.edu wrote: Dear hadoop users, Recently I have had a number of drive failures that slowed down processes a lot until they were discovered. It is there any easy way or tool, to check HDD performance and see if there any IO errors? Currently I wrote a simple script that looks at /var/log/messages and greps everything abnormal for /dev/sdaX. But if you have better solution I'd appreciate if you share it. --- Dmitry Pushkarev +1-650-644-8988
Re: HDD benchmark/checking tool
Also, you want to look at combining SMART hard drive monitoring (most drives support SMART at this point) and combine it with Nagios. It often lets us known when a hard drive is about to fail *and* when the drive is under-performing. Brian On Feb 3, 2009, at 6:18 PM, Aaron Kimball wrote: Dmitry, Look into cluster/system monitoring tools: nagios and ganglia are two to start with. - Aaron On Tue, Feb 3, 2009 at 9:53 AM, Dmitry Pushkarev u...@stanford.edu wrote: Dear hadoop users, Recently I have had a number of drive failures that slowed down processes a lot until they were discovered. It is there any easy way or tool, to check HDD performance and see if there any IO errors? Currently I wrote a simple script that looks at /var/log/messages and greps everything abnormal for /dev/sdaX. But if you have better solution I'd appreciate if you share it. --- Dmitry Pushkarev +1-650-644-8988
Re: Hadoop's reduce tasks are freezes at 0%.
Alternatively, maybe all the TaskTracker nodes can contact the NameNode and JobTracker, but cannot communicate with one another due to firewall issues? - Aaron On Mon, Feb 2, 2009 at 7:54 PM, jason hadoop jason.had...@gmail.com wrote: A reduce stall at 0% implies that the map tasks are not outputting any records via the output collector. You need to go look at the task tracker and the task logs on all of your slave machines, to see if anything that seems odd appears in the logs. On the tasktracker web interface detail screen for your job, Are all of the map tasks finished Are any of the map tasks started Are there any Tasktracker nodes to service your job On Sun, Feb 1, 2009 at 11:41 PM, Kwang-Min Choi kmbest.c...@samsung.com wrote: I'm newbie in Hadoop. and i'm trying to follow Hadoop Quick Guide at hadoop homepage. but, there are some problems... Downloading, unzipping hadoop is done. and ssh successfully operate without password phrase. once... I execute grep example attached to Hadoop... map task is ok. it reaches 100%. but reduce task freezes at 0% without any error message. I've waited it for more than 1 hour, but it still freezes... same job in standalone mode is well done... i tried it with version 0.18.3 and 0.17.2.1. all of them had same problem. could help me to solve this problem? Additionally... I'm working on cloud-infra of GoGrid(Redhat). So, disk's space health is OK. and, i've installed JDK 1.6.11 for linux successfully. - KKwams
Re: extra documentation on how to write your own partitioner class
er? It seems to be using value.get(). That having been said, you should really partition based on key, not on value. (I am not sure why, exactly, the value is provided to the getPartition() method.) Moreover, I think the problem is that you are using division ( / ) not modulus ( % ). Your code simplifies to: (value.get() / T) / (T / numPartitions) = value.get() * numPartitions / T^2. The contract of getPartition() is that it returns a value in [0, numPartitions). The division operators are not guaranteed to return anything in this range, but (foo % numPartitions) will always do the right thing. So it's probably just assigning everything to reduce partition 0. (Alternatively, it could be that value * numPartitions T^2 for any values of T you're testing with, which means that integer division will return 0.) - Aaron On Fri, Jan 30, 2009 at 3:43 PM, Sandy snickerdoodl...@gmail.com wrote: Hi James, Thank you very much! :-) -SM On Fri, Jan 30, 2009 at 4:17 PM, james warren ja...@rockyou.com wrote: Hello Sandy - Your partitioner isn't using any information from the key/value pair - it's only using the value T which is read once from the job configuration. getPartition() will always return the same value, so all of your data is being sent to one reducer. :P cheers, -James On Fri, Jan 30, 2009 at 1:32 PM, Sandy snickerdoodl...@gmail.com wrote: Hello, Could someone point me toward some more documentation on how to write one's own partition class? I have having quite a bit of trouble getting mine to work. So far, it looks something like this: public class myPartitioner extends MapReduceBase implements PartitionerIntWritable, IntWritable { private int T; public void configure(JobConf job) { super.configure(job); String myT = job.get(tval);//this is user defined T = Integer.parseInt(myT); } public int getPartition(IntWritable key, IntWritable value, int numReduceTasks) { int newT = (T/numReduceTasks); int id = ((value.get()/ T); return (int)(id/newT); } } In the run() function of my M/R program I just set it using: conf.setPartitionerClass(myPartitioner.class); Is there anything else I need to set in the run() function? The code compiles fine. When I run it, I know it is using the partitioner, since I get different output than if I just let it use HashPartitioner. However, it is not splitting between the reducers at all! If I set the number of reducers to 2, all the output shows up in part-0, while part-1 has nothing. I am having trouble debugging this since I don't know how I can observe the values of numReduceTasks (which I assume is being set by the system). Is this a proper assumption? If I try to insert any println() statements in the function, it isn't outputted to either my terminal or my log files. Could someone give me some general advice on how best to debug pieces of code like this?
Hadoop FS Shell - command overwrite capability
I'm using the Hadoop FS commands to move files from my local machine into the Hadoop dfs. I'd like a way to force a write to the dfs even if a file of the same name exists. Ideally I'd like to use a -force switch or some such; e.g., hadoop dfs -copyFromLocal -force adirectory s3n://wholeinthebucket/ Is there a way to do this or does anyone know if this is in the future Hadoop plans? Thanks John SD
Control over max map/reduce tasks per job
All, I have a few relatively small clusters (5-20 nodes) and am having trouble keeping them loaded with my MR jobs. The primary issue is that I have different jobs that have drastically different patterns. I have jobs that read/write to/from HBase or Hadoop with minimal logic (network throughput bound or io bound), others that perform crawling (network latency bound), and one huge parsing streaming job (very CPU bound, each task eats a core). I'd like to launch very large numbers of tasks for network latency bound jobs, however the large CPU bound job means I have to keep the max maps allowed per node low enough as to not starve the Datanode and Regionserver. I'm an HBase dev but not familiar enough with Hadoop MR code to even know what would be involved with implementing this. However, in talking with other users, it seems like this would be a well-received option. I wanted to ping the list before filing an issue because it seems like someone may have thought about this in the past. Thanks. Jonathan Gray
FileInputFormat directory traversal
Is there a reason FileInputFormat only traverses the first level of directories in its InputPaths? (i.e., given an InputPath of 'foo', it will get foo/* but not foo/bar/*). I wrote a full depth-first traversal in my custom InputFormat which I can offer as a patch. But to do it I had to duplicate the PathFilter classes in FileInputFormat which are marked private, so a mainline patch would also touch FileInputFormat. Ian
Chukwa documentation
Hi Everybody, I don't know if there is a mail list for Chukwa so I apologies in advance if this is not the right place to ask my questions. I have the following questions and comments: It was simple the configuration of the collector and the agent. However, there is other features that are not documented it all like: - torque (Do I have to install torque before? Yes? No? and Why?), - database, (Do I have to have a DB?) -what is queueinfo.properties, which kind of information provides me? -and there is more stuff that I need to dig in the code to understand. Could somebody update the documentation from Chukwa?. I hope I can get some help Xavier
How to use DBInputFormat?
In the setInput(...) function in DBInputFormat, there are two sets of arguments that one can use. 1. public static void *setInput*(JobConf http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html job, Class http://java.sun.com/javase/6/docs/api/java/lang/Class.html?is-external=true? extends DBWritable http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/db/DBWritable.html inputClass, String http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true tableName, String http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true conditions, String http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true orderBy, String http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true... fieldNames) a) In this, do we necessarily have to give all the fieldNames (which are the column names right?) that the table has, or do we need to specify only the ones that we want to extract? b) Do we have to have a orderBy or not necessarily? Does this relate to the primary key in the table in any ways? 2. public static void *setInput*(JobConf http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html job, Class http://java.sun.com/javase/6/docs/api/java/lang/Class.html?is-external=true? extends DBWritable http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/db/DBWritable.html inputClass, String http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true inputQuery, String http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true inputCountQuery) a) Is there any restriction on the kind of queries that this function can take in the inputQuery string? I am facing issues in getting this to work with an Oracle database and have no idea of how to debug it (an email sent earlier). Can anyone give me some inputs on this please? Thanks -Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz
HADOOP-2536 supports Oracle too?
Does the patch HADOOP-2536 support connecting to Oracle databases as well? Or is it just limited to MySQL? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz
Re: HADOOP-2536 supports Oracle too?
HADOOP-2536 supports the JDBC connector interface. Any database that exposes a JDBC library will work - Aaron On Tue, Feb 3, 2009 at 6:17 PM, Amandeep Khurana ama...@gmail.com wrote: Does the patch HADOOP-2536 support connecting to Oracle databases as well? Or is it just limited to MySQL? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz
Re: How to use DBInputFormat?
On Tue, Feb 3, 2009 at 5:49 PM, Amandeep Khurana ama...@gmail.com wrote: In the setInput(...) function in DBInputFormat, there are two sets of arguments that one can use. 1. public static void *setInput*(JobConf a) In this, do we necessarily have to give all the fieldNames (which are the column names right?) that the table has, or do we need to specify only the ones that we want to extract? You may specify only those columns that you are interested in. b) Do we have to have a orderBy or not necessarily? Does this relate to the primary key in the table in any ways? Conditions and order by are not necessary. a) Is there any restriction on the kind of queries that this function can take in the inputQuery string? I don't think so, but I don't use this method -- I just use the fieldNames and tableName method. I am facing issues in getting this to work with an Oracle database and have no idea of how to debug it (an email sent earlier). Can anyone give me some inputs on this please? Create a new table that has one column, put about five entries into that table, then try to get a map job working that outputs the values to a text file. If that doesn't work, post your code and errors.
Re: How to use DBInputFormat?
Thanks Kevin I couldnt get it work. Here's the error I get: bin/hadoop jar ~/dbload.jar LoadTable1 09/02/03 19:21:17 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 09/02/03 19:21:20 INFO mapred.JobClient: Running job: job_local_0001 09/02/03 19:21:21 INFO mapred.JobClient: map 0% reduce 0% 09/02/03 19:21:22 INFO mapred.MapTask: numReduceTasks: 0 09/02/03 19:21:24 WARN mapred.LocalJobRunner: job_local_0001 java.io.IOException: ORA-00933: SQL command not properly ended at org.apache.hadoop.mapred.lib.db.DBInputFormat.getRecordReader(DBInputFormat.java:289) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) at LoadTable1.run(LoadTable1.java:130) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at LoadTable1.main(LoadTable1.java:107) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) Exception closing file /user/amkhuran/contract_table/_temporary/_attempt_local_0001_m_00_0/part-0 java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198) at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:65) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3084) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3053) at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.close(DFSClient.java:942) at org.apache.hadoop.hdfs.DFSClient.close(DFSClient.java:210) at org.apache.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem.java:243) at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1413) at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:236) at org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:221) Here's my code: public class LoadTable1 extends Configured implements Tool { // data destination on hdfs private static final String CONTRACT_OUTPUT_PATH = contract_table; // The JDBC connection URL and driver implementation class private static final String CONNECT_URL = jdbc:oracle:thin:@dbhost :1521:PSEDEV; private static final String DB_USER = user; private static final String DB_PWD = pass; private static final String DATABASE_DRIVER_CLASS = oracle.jdbc.driver.OracleDriver; private static final String CONTRACT_INPUT_TABLE = OSE_EPR_CONTRACT; private static final String [] CONTRACT_INPUT_TABLE_FIELDS = { PORTFOLIO_NUMBER, CONTRACT_NUMBER}; private static final String ORDER_CONTRACT_BY_COL = CONTRACT_NUMBER; static class ose_epr_contract implements Writable, DBWritable { String CONTRACT_NUMBER; public void readFields(DataInput in) throws IOException { this.CONTRACT_NUMBER = Text.readString(in); } public void write(DataOutput out) throws IOException { Text.writeString(out, this.CONTRACT_NUMBER); } public void readFields(ResultSet in_set) throws SQLException { this.CONTRACT_NUMBER = in_set.getString(1); } @Override public void write(PreparedStatement prep_st) throws SQLException { // TODO Auto-generated method stub } } public static class LoadMapper extends MapReduceBase implements MapperLongWritable, ose_epr_contract, Text, NullWritable { private static final char FIELD_SEPARATOR = 1; public void map(LongWritable arg0, ose_epr_contract arg1, OutputCollectorText, NullWritable arg2, Reporter arg3) throws IOException { StringBuilder sb = new StringBuilder(); ; sb.append(arg1.CONTRACT_NUMBER); arg2.collect(new Text (sb.toString()), NullWritable.get()); } } public static void main(String[] args) throws Exception { Class.forName(oracle.jdbc.driver.OracleDriver); int exit = ToolRunner.run(new LoadTable1(), args); } public int run(String[] arg0) throws Exception { JobConf conf = new
Re: How to use DBInputFormat?
The same query is working if I write a simple JDBC client and query the database. So, I'm probably doing something wrong in the connection settings. But the error looks to be on the query side more than the connection side. Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Tue, Feb 3, 2009 at 7:25 PM, Amandeep Khurana ama...@gmail.com wrote: Thanks Kevin I couldnt get it work. Here's the error I get: bin/hadoop jar ~/dbload.jar LoadTable1 09/02/03 19:21:17 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 09/02/03 19:21:20 INFO mapred.JobClient: Running job: job_local_0001 09/02/03 19:21:21 INFO mapred.JobClient: map 0% reduce 0% 09/02/03 19:21:22 INFO mapred.MapTask: numReduceTasks: 0 09/02/03 19:21:24 WARN mapred.LocalJobRunner: job_local_0001 java.io.IOException: ORA-00933: SQL command not properly ended at org.apache.hadoop.mapred.lib.db.DBInputFormat.getRecordReader(DBInputFormat.java:289) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) at LoadTable1.run(LoadTable1.java:130) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at LoadTable1.main(LoadTable1.java:107) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) Exception closing file /user/amkhuran/contract_table/_temporary/_attempt_local_0001_m_00_0/part-0 java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198) at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:65) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3084) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3053) at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.close(DFSClient.java:942) at org.apache.hadoop.hdfs.DFSClient.close(DFSClient.java:210) at org.apache.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem.java:243) at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1413) at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:236) at org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:221) Here's my code: public class LoadTable1 extends Configured implements Tool { // data destination on hdfs private static final String CONTRACT_OUTPUT_PATH = contract_table; // The JDBC connection URL and driver implementation class private static final String CONNECT_URL = jdbc:oracle:thin:@dbhost :1521:PSEDEV; private static final String DB_USER = user; private static final String DB_PWD = pass; private static final String DATABASE_DRIVER_CLASS = oracle.jdbc.driver.OracleDriver; private static final String CONTRACT_INPUT_TABLE = OSE_EPR_CONTRACT; private static final String [] CONTRACT_INPUT_TABLE_FIELDS = { PORTFOLIO_NUMBER, CONTRACT_NUMBER}; private static final String ORDER_CONTRACT_BY_COL = CONTRACT_NUMBER; static class ose_epr_contract implements Writable, DBWritable { String CONTRACT_NUMBER; public void readFields(DataInput in) throws IOException { this.CONTRACT_NUMBER = Text.readString(in); } public void write(DataOutput out) throws IOException { Text.writeString(out, this.CONTRACT_NUMBER); } public void readFields(ResultSet in_set) throws SQLException { this.CONTRACT_NUMBER = in_set.getString(1); } @Override public void write(PreparedStatement prep_st) throws SQLException { // TODO Auto-generated method stub } } public static class LoadMapper extends MapReduceBase implements MapperLongWritable, ose_epr_contract, Text, NullWritable { private static final char FIELD_SEPARATOR = 1; public void map(LongWritable arg0, ose_epr_contract arg1, OutputCollectorText, NullWritable arg2, Reporter arg3)
Value-Only Reduce Output
Hello, I'm interested in a map-reduce flow where I output only values (no keys) in my reduce step. For example, imagine the canonical word-counting program where I'd like my output to be an unlabeled histogram of counts instead of (word, count) pairs. I'm using HadoopStreaming (specifically, I'm using the dumbo module to run my python scripts). When I simulate the map reduce using pipes and sort in bash, it works fine. However, in Hadoop, if I output a value with no tabs, Hadoop appends a trailing \t, apparently interpreting my output as a (value, ) KV pair. I'd like to avoid outputing this trailing tab if possible. Is there a command line option that could be use to effect this? More generally, is there something wrong with outputing arbitrary strings, instead of key-value pairs, in your reduce step?
Re: hadoop dfs -test question (with a bit o' Ruby)
I'm no Ruby programmer, but don't you need a call to system() instead of the backtick operator here? Appears that the backtick operator returns STDOUT instead of the return value: http://hans.fugal.net/blog/2007/11/03/backticks-2-0 Norbert On Tue, Feb 3, 2009 at 6:03 PM, S D sd.codewarr...@gmail.com wrote: Coincidentally I'm aware of the AWS::S3 package in Ruby but I'd prefer to avoid that... On Tue, Feb 3, 2009 at 5:02 PM, S D sd.codewarr...@gmail.com wrote: I'm at my wit's end. I want to do a simple test for the existence of a file on Hadoop. Here is the Ruby code I'm trying: val = `hadoop dfs -test -e s3n://holeinthebucket/user/hadoop/file.txt` puts Val: #{val} if val == 1 // do one thing else // do another end I never get a return value for the -test command. Is there something I'm missing? How should the return value be retrieved? Thanks, SD
Re: Value-Only Reduce Output
If you are using the standard TextOutputFormat, and the output collector is passed a null for the value, there will not be a trailing tab character added to the output line. output.collect( key, null ); Will give you the behavior you are looking for if your configuration is as I expect. On Tue, Feb 3, 2009 at 7:49 PM, Jack Stahl j...@yelp.com wrote: Hello, I'm interested in a map-reduce flow where I output only values (no keys) in my reduce step. For example, imagine the canonical word-counting program where I'd like my output to be an unlabeled histogram of counts instead of (word, count) pairs. I'm using HadoopStreaming (specifically, I'm using the dumbo module to run my python scripts). When I simulate the map reduce using pipes and sort in bash, it works fine. However, in Hadoop, if I output a value with no tabs, Hadoop appends a trailing \t, apparently interpreting my output as a (value, ) KV pair. I'd like to avoid outputing this trailing tab if possible. Is there a command line option that could be use to effect this? More generally, is there something wrong with outputing arbitrary strings, instead of key-value pairs, in your reduce step?
Re: Value-Only Reduce Output
Ooops, you are using streaming., and I am not familar. As a terrible hack, you could set mapred.textoutputformat.separator to the empty string, in your configuration. On Tue, Feb 3, 2009 at 9:26 PM, jason hadoop jason.had...@gmail.com wrote: If you are using the standard TextOutputFormat, and the output collector is passed a null for the value, there will not be a trailing tab character added to the output line. output.collect( key, null ); Will give you the behavior you are looking for if your configuration is as I expect. On Tue, Feb 3, 2009 at 7:49 PM, Jack Stahl j...@yelp.com wrote: Hello, I'm interested in a map-reduce flow where I output only values (no keys) in my reduce step. For example, imagine the canonical word-counting program where I'd like my output to be an unlabeled histogram of counts instead of (word, count) pairs. I'm using HadoopStreaming (specifically, I'm using the dumbo module to run my python scripts). When I simulate the map reduce using pipes and sort in bash, it works fine. However, in Hadoop, if I output a value with no tabs, Hadoop appends a trailing \t, apparently interpreting my output as a (value, ) KV pair. I'd like to avoid outputing this trailing tab if possible. Is there a command line option that could be use to effect this? More generally, is there something wrong with outputing arbitrary strings, instead of key-value pairs, in your reduce step?
Re: Control over max map/reduce tasks per job
This sounds good enough for a JIRA ticket to me. -Bryan On Feb 3, 2009, at 11:44 AM, Jonathan Gray wrote: Chris, For my specific use cases, it would be best to be able to set N mappers/reducers per job per node (so I can explicitly say, run at most 2 at a time of this CPU bound task on any given node). However, the other way would work as well (on 10 node system, would set job to max 20 tasks at a time globally), but opens up the possibility that a node could be assigned more than 2 of that task. I would work with whatever is easiest to implement as either would be a vast improvement for me (can run high numbers of network latency bound tasks without fear of cpu bound tasks killing the cluster). JG -Original Message- From: Chris K Wensel [mailto:ch...@wensel.net] Sent: Tuesday, February 03, 2009 11:34 AM To: core-user@hadoop.apache.org Subject: Re: Control over max map/reduce tasks per job Hey Jonathan Are you looking to limit the total number of concurrent mapper/ reducers a single job can consume cluster wide, or limit the number per node? That is, you have X mappers/reducers, but only can allow N mappers/ reducers to run at a time globally, for a given job. Or, you are cool with all X running concurrently globally, but want to guarantee that no node can run more than N tasks from that job? Or both? just reconciling the conversation we had last week with this thread. ckw On Feb 3, 2009, at 11:16 AM, Jonathan Gray wrote: All, I have a few relatively small clusters (5-20 nodes) and am having trouble keeping them loaded with my MR jobs. The primary issue is that I have different jobs that have drastically different patterns. I have jobs that read/write to/from HBase or Hadoop with minimal logic (network throughput bound or io bound), others that perform crawling (network latency bound), and one huge parsing streaming job (very CPU bound, each task eats a core). I'd like to launch very large numbers of tasks for network latency bound jobs, however the large CPU bound job means I have to keep the max maps allowed per node low enough as to not starve the Datanode and Regionserver. I'm an HBase dev but not familiar enough with Hadoop MR code to even know what would be involved with implementing this. However, in talking with other users, it seems like this would be a well-received option. I wanted to ping the list before filing an issue because it seems like someone may have thought about this in the past. Thanks. Jonathan Gray -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/
Re: Hadoop 0.19, Cascading 1.0 and MultipleOutputs problem
Mikhail, You are right, please open a Jira on this. Alejandro On Wed, Jan 28, 2009 at 9:23 PM, Mikhail Yakshin greycat.na@gmail.comwrote: Hi, We have a system based on Hadoop 0.18 / Cascading 0.8.1 and now I'm trying to port it to Hadoop 0.19 / Cascading 1.0. The first serious problem I've got into that we're extensively using MultipleOutputs in our jobs dealing with sequence files that store Cascading's Tuples. Since Cascading 0.9, Tuples stopped being WritableComparable and implemented generic Hadoop serialization interface and framework. However, in Hadoop 0.19, MultipleOutputs require use of older WritableComparable interface. Thus, trying to do something like: MultipleOutputs.addNamedOutput(conf, output-name, MySpecialMultiSplitOutputFormat.class, Tuple.class, Tuple.class); mos = new MultipleOutputs(conf); ... mos.getCollector(output-name, reporter).collect(tuple1, tuple2); yields an error: java.lang.RuntimeException: java.lang.RuntimeException: class cascading.tuple.Tuple not org.apache.hadoop.io.WritableComparable at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:752) at org.apache.hadoop.mapred.lib.MultipleOutputs.getNamedOutputKeyClass(MultipleOutputs.java:252) at org.apache.hadoop.mapred.lib.MultipleOutputs$InternalFileOutputFormat.getRecordWriter(MultipleOutputs.java:556) at org.apache.hadoop.mapred.lib.MultipleOutputs.getRecordWriter(MultipleOutputs.java:425) at org.apache.hadoop.mapred.lib.MultipleOutputs.getCollector(MultipleOutputs.java:511) at org.apache.hadoop.mapred.lib.MultipleOutputs.getCollector(MultipleOutputs.java:476) at my.namespace.MyReducer.reduce(MyReducer.java:xxx) Is there any known workaround for that? Any progress going on to make MultipleOutputs use generic Hadoop serialization? -- WBR, Mikhail Yakshin
easy way to create IntelliJ project for Hadoop trunk?
With ivy fetching JARs as part of the build and several src dirs creating an IDE (IntelliJ in my case) project becomes a pain. Any easy way of doing it? Thxs. Alejandro