A record version mismatch occured. Expecting v6, found v32
Hi, I tried to use SequenceFileInputFormat, for this I appended SEQ as first bytes of my binary files (with hex editor). but I get this exception: A record version mismatch occured. Expecting v6, found v32 at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1460) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1428) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1417) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1412) at org.apache.hadoop.mapred.SequenceFileRecordReader.init(SequenceFileRecordReader.java:43) at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321) at org.apache.hadoop.mapred.Child.main(Child.java:155) What could it be? Is it not enough just to add SEQ to binary files? I use Hadoop v.0.19.0 . Thanks in advance.. Rasit different *version* of *Hadoop* between your server and your client. -- M. Raşit ÖZDAŞ
Re: best way to copy all files from a file system to hdfs
Is there any reason why it has to be a single SequenceFile? You could write a local program to write several block compressed SequenceFiles in parallel (to HDFS), each containing a portion of the files on your PC. Tom On Mon, Feb 2, 2009 at 3:24 PM, Mark Kerzner markkerz...@gmail.com wrote: Truly, I do not see any advantage to doing this, as opposed to writing (Java) code which will copy files to HDFS, because then tarring becomes my bottleneck. Unless I write code measure the file sizes and prepare pointers for multiple tarring tasks. It becomes pretty complex though, and I thought of something simple. I might as well accept that copying one hard drive to HDFS is not going to be parallelized. Mark On Sun, Feb 1, 2009 at 11:44 PM, Philip (flip) Kromer f...@infochimps.orgwrote: Could you tar.bz2 them up (setting up the tar so that it made a few dozen files), toss them onto the HDFS, and use http://stuartsierra.com/2008/04/24/a-million-little-files to go into SequenceFile? This lets you preserve the originals and do the sequence file conversion across the cluster. It's only really helpful, of course, if you also want to prepare a .tar.bz2 so you can clear out the sprawl flip On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner markkerz...@gmail.com wrote: Hi, I am writing an application to copy all files from a regular PC to a SequenceFile. I can surely do this by simply recursing all directories on my PC, but I wonder if there is any way to parallelize this, a MapReduce task even. Tom White's books seems to imply that it will have to be a custom application. Thank you, Mark -- http://www.infochimps.org Connected Open Free Data
Re: best way to copy all files from a file system to hdfs
No, no reason for a single file - just a little simpler to think about. By the way, can multiple MapReduce workers read the same SequenceFile simultaneously? On Mon, Feb 2, 2009 at 9:42 AM, Tom White t...@cloudera.com wrote: Is there any reason why it has to be a single SequenceFile? You could write a local program to write several block compressed SequenceFiles in parallel (to HDFS), each containing a portion of the files on your PC. Tom On Mon, Feb 2, 2009 at 3:24 PM, Mark Kerzner markkerz...@gmail.com wrote: Truly, I do not see any advantage to doing this, as opposed to writing (Java) code which will copy files to HDFS, because then tarring becomes my bottleneck. Unless I write code measure the file sizes and prepare pointers for multiple tarring tasks. It becomes pretty complex though, and I thought of something simple. I might as well accept that copying one hard drive to HDFS is not going to be parallelized. Mark On Sun, Feb 1, 2009 at 11:44 PM, Philip (flip) Kromer f...@infochimps.orgwrote: Could you tar.bz2 them up (setting up the tar so that it made a few dozen files), toss them onto the HDFS, and use http://stuartsierra.com/2008/04/24/a-million-little-files to go into SequenceFile? This lets you preserve the originals and do the sequence file conversion across the cluster. It's only really helpful, of course, if you also want to prepare a .tar.bz2 so you can clear out the sprawl flip On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner markkerz...@gmail.com wrote: Hi, I am writing an application to copy all files from a regular PC to a SequenceFile. I can surely do this by simply recursing all directories on my PC, but I wonder if there is any way to parallelize this, a MapReduce task even. Tom White's books seems to imply that it will have to be a custom application. Thank you, Mark -- http://www.infochimps.org Connected Open Free Data
Re: A record version mismatch occured. Expecting v6, found v32
The SequenceFile format is described here: http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/SequenceFile.html. The format of the keys and values depends on the serialization classes used. For example, BytesWritable writes out the length of its byte array followed by the actual bytes in the array (see the write() method in BytesWritable). Hope this helps. Tom On Mon, Feb 2, 2009 at 3:21 PM, Rasit OZDAS rasitoz...@gmail.com wrote: I tried to use SequenceFile.Writer to convert my binaries into Sequence Files, I read the binary data with FileInputStream, getting all bytes with reader.read(byte[]) , wrote it to a file with SequenceFile.Writer, with parameters NullWritable as key, BytesWritable as value. But the content changes, (I can see that by converting to Base64) Binary File: 73 65 65 65 81 65 65 65 65 65 81 81 65 119 84 81 65 111 67 81 65 52 57 81 65 103 54 81 65 65 97 81 65 65 65 81 ... Sequence File: 73 65 65 65 65 69 65 65 65 65 65 65 65 69 66 65 65 77 66 77 81 103 67 103 67 69 77 65 52 80 86 67 65 73 68 114 ... Thanks for any points.. Rasit 2009/2/2 Rasit OZDAS rasitoz...@gmail.com Hi, I tried to use SequenceFileInputFormat, for this I appended SEQ as first bytes of my binary files (with hex editor). but I get this exception: A record version mismatch occured. Expecting v6, found v32 at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1460) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1428) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1417) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1412) at org.apache.hadoop.mapred.SequenceFileRecordReader.init(SequenceFileRecordReader.java:43) at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321) at org.apache.hadoop.mapred.Child.main(Child.java:155) What could it be? Is it not enough just to add SEQ to binary files? I use Hadoop v.0.19.0 . Thanks in advance.. Rasit different *version* of *Hadoop* between your server and your client. -- M. Raşit ÖZDAŞ -- M. Raşit ÖZDAŞ
Re: settin JAVA_HOME...
It's exactly as Steve says. Sorry, I should have been clearer in my last e-mail. I have had really bad experiences with any other jdk (the default on ubuntu, gcj, etc) than Sun's. So it may be easier to do that. To stop all hadoop processes use: bin/stop-all.sh To start them, use: bin/start-all.sh Whenever you make a change in your conf/hadoop-env.sh or conf/hadoop-site.xml files, you will need to restart hadoop using the above two scripts. All the best, -SM On Mon, Feb 2, 2009 at 4:40 AM, Steve Loughran ste...@apache.org wrote: haizhou zhao wrote: hi Sandy, Every time I change the conf, i have to do the following to things: 1. kill all hadoop processes 2. manually delelte all the file under hadoop.tmp.dir to make sure hadoop runs correctly, otherwise it wont work. Is this cause'd by my using a JDK instead of sun java? No, you need to do that to get configuration changes picked up. There are scripts in hadoop/bin to help you and what do you mean by sun-java, please? Sandy means * sun-java6-jdk: Sun's released JDK * default-jdk ubuntu chooses. On 8.10, it is open-jdk * open-jdk-6-jdk: the full open source version of the JDK. Worse font rendering code, but comes with more source Others * Oracle JRockit: good 64-bit memory management, based on the sun JDK unsupported * IBM JVM unsupported. Based on the sun JDK * Apache Harmony: clean room rewrite of everything. unsupported * Kaffe. unsupported * Gcj. unsupported type java -version to get your java version Sun java version 1.6.0_10 Java(TM) SE Runtime Environment (build 1.6.0_10-b33) Java HotSpot(TM) Server VM (buld 11.0-b14, mixed mode JRockit: java version 1.6.0_02 Java(TM) SE Runtime Environment (build 1.6.0_02-b05) BEA JRockit(R) (build R27.4.0-90-89592-1.6.0_02-20070928-1715-linux-x86_64, compiled mode) 2009/1/31 Sandy snickerdoodl...@gmail.com Hi Zander, Do not use jdk. Horrific things happen. You must use sun java in order to use hadoop.
SequenceFiles, checkpoints, block size (Was: How to flush SequenceFile.Writer?)
Let me rephrase this problem... as stated below, when I start writing to a SequenceFile from an HDFS client, nothing is visible in HDFS until I've written 64M of data. This presents three problems: fsck reports the file system as corrupt until the first block is finally written out, the presence of the file (without any data) seems to blow up my mapred jobs that try to make use of it under my input path, and finally, I want to basically flush every 15 minutes or so so I can mapred the latest data. I don't see any programmatic way to force the file to flush in 17.2. Additionally, dfs.checkpoint.period does not seem to be obeyed. Does that not do what I think it does? What controls the 64M limit, anyway? Is it dfs.checkpoint.size or dfs.block.size? Is the buffering happening on the client, or on data nodes? Or in the namenode? It seems really bad that a SequenceFile, upon creation, is in an unusable state from the perspective of a mapred job, and also leaves fsck in a corrupt state. Surely I must be doing something wrong... but what? How can I ensure that a SequenceFile is immediately usable (but empty) on creation, and how can I make things flush on some regular time interval? Thanks, Brian On Thu, Jan 29, 2009 at 4:17 PM, Brian Long br...@dotspots.com wrote: I have a SequenceFile.Writer that I obtained via SequenceFile.createWriter and write to using append(key, value). Because the writer volume is low, it's not uncommon for it to take over a day for my appends to finally be flushed to HDFS (e.g. the new file will sit at 0 bytes for over a day). Because I am running map/reduce tasks on this data multiple times a day, I want to flush the sequence file so the mapred jobs can pick it up when they run. What's the right way to do this? I'm assuming it's a fairly common use case. Also -- are writes to the sequence files atomic? (e.g. if I am actively appending to a sequence file, is it always safe to read from that same file in a mapred job?) To be clear, I want the flushing to be time based (controlled explicitly by the app), not size based. Will this create waste in HDFS somehow? Thanks, Brian
Re: settin JAVA_HOME...
haizhou zhao wrote: hi Sandy, Every time I change the conf, i have to do the following to things: 1. kill all hadoop processes 2. manually delelte all the file under hadoop.tmp.dir to make sure hadoop runs correctly, otherwise it wont work. Is this cause'd by my using a JDK instead of sun java? No, you need to do that to get configuration changes picked up. There are scripts in hadoop/bin to help you and what do you mean by sun-java, please? Sandy means * sun-java6-jdk: Sun's released JDK * default-jdk ubuntu chooses. On 8.10, it is open-jdk * open-jdk-6-jdk: the full open source version of the JDK. Worse font rendering code, but comes with more source Others * Oracle JRockit: good 64-bit memory management, based on the sun JDK unsupported * IBM JVM unsupported. Based on the sun JDK * Apache Harmony: clean room rewrite of everything. unsupported * Kaffe. unsupported * Gcj. unsupported type java -version to get your java version Sun java version 1.6.0_10 Java(TM) SE Runtime Environment (build 1.6.0_10-b33) Java HotSpot(TM) Server VM (buld 11.0-b14, mixed mode JRockit: java version 1.6.0_02 Java(TM) SE Runtime Environment (build 1.6.0_02-b05) BEA JRockit(R) (build R27.4.0-90-89592-1.6.0_02-20070928-1715-linux-x86_64, compiled mode) 2009/1/31 Sandy snickerdoodl...@gmail.com Hi Zander, Do not use jdk. Horrific things happen. You must use sun java in order to use hadoop.
Re: problem with completion notification from block movement
On Sun, 2009-02-01 at 17:58 -0800, jason hadoop wrote: The Datanode's use multiple threads with locking and one of the assumptions is that the block report (1ce per hour by default) takes little time. The datanode will pause while the block report is running and if it happens to take a while weird things start to happen. Thank you for responding, this is very informative for us. Having looked through the source code with a co-worker regarding periodic scan and then checking the logs once again, we find that we are finding reports of this sort: BlockReport of 1158499 blocks got processed in 308860 msecs BlockReport of 1159840 blocks got processed in 237925 msecs BlockReport of 1161274 blocks got processed in 177853 msecs BlockReport of 1162408 blocks got processed in 285094 msecs BlockReport of 1164194 blocks got processed in 184478 msecs BlockReport of 1165673 blocks got processed in 226401 msecs The 3rd of these exactly straddles the particular example timeline I discussed in my original email about this question. I suspect I'll find more of the same as I look through other related errors. --karl
Re: MapFile.Reader and seek
You can use the get() method to seek and retrieve the value. It will return null if the key is not in the map. Something like: Text value = (Text) indexReader.get(from, new Text()); while (value != null ...) Tom On Thu, Jan 29, 2009 at 10:45 PM, schnitzi mark.schnitz...@fastsearch.com wrote: Greetings all... I have a situation where I want to read a range of keys and values out of a MapFile. So I have something like this: MapFile.Reader indexReader = new MapFile.Reader(fs, path.toString(), configuration) boolean seekSuccess = indexReader.seek(from); boolean readSuccess = indexReader.next(keyValue, value); while (readSuccess ...) The problem seems to be that while seekSuccess is returning true, when I call next() to get the value there, it's returning the value *after* the key that I called seek() on. So if, say, my keys are Text(id0) through Text(id9), and I seek for Text(id3), calling next() will return Text(id4) and its associated value, not Text(id3). I would expect next() to return the key/value at the seek location, not the one after it. Am I doing something wrong? Otherwise, what good is seek(), really? -- View this message in context: http://www.nabble.com/MapFile.Reader-and-seek-tp21737717p21737717.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: best way to copy all files from a file system to hdfs
Truly, I do not see any advantage to doing this, as opposed to writing (Java) code which will copy files to HDFS, because then tarring becomes my bottleneck. Unless I write code measure the file sizes and prepare pointers for multiple tarring tasks. It becomes pretty complex though, and I thought of something simple. I might as well accept that copying one hard drive to HDFS is not going to be parallelized. Mark On Sun, Feb 1, 2009 at 11:44 PM, Philip (flip) Kromer f...@infochimps.orgwrote: Could you tar.bz2 them up (setting up the tar so that it made a few dozen files), toss them onto the HDFS, and use http://stuartsierra.com/2008/04/24/a-million-little-files to go into SequenceFile? This lets you preserve the originals and do the sequence file conversion across the cluster. It's only really helpful, of course, if you also want to prepare a .tar.bz2 so you can clear out the sprawl flip On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner markkerz...@gmail.com wrote: Hi, I am writing an application to copy all files from a regular PC to a SequenceFile. I can surely do this by simply recursing all directories on my PC, but I wonder if there is any way to parallelize this, a MapReduce task even. Tom White's books seems to imply that it will have to be a custom application. Thank you, Mark -- http://www.infochimps.org Connected Open Free Data
Re: job management in Hadoop
Thanks. I see that ACL is implemented in 0.19.0. I think that's only for job management from the command line, right? Is there any ACL for the web interface? Bill On Fri, Jan 30, 2009 at 6:23 PM, Bhupesh Bansal bban...@linkedin.comwrote: Bill, Currently you can kill the job from the UI. You have to enable the config in hadoop-default.xml namewebinterface.private.actions/name to be true Best Bhupesh On 1/30/09 3:23 PM, Bill Au bill.w...@gmail.com wrote: Thanks. Anyone knows if there is plan to add this functionality to the web UI like job priority can be changed from both the command line and the web UI? Bill On Fri, Jan 30, 2009 at 5:54 PM, Arun C Murthy a...@yahoo-inc.com wrote: On Jan 30, 2009, at 2:41 PM, Bill Au wrote: Is there any way to cancel a job after it has been submitted? bin/hadoop job -kill jobid Arun
Re: HDFS Appends in 0.19.0
Hi Arifa, The O_APPEND flag is the subject of https://issues.apache.org/jira/browse/HADOOP-4494 Craig Arifa Nisar wrote: Hello All, I am using hadoop 0.19.0, whose release notes includes HADOOP-1700 Introduced append operation for HDFS files. I am trying to test this new feature using my test program. I have experienced that O_APPEND flag added in hdfsopen() is ignored by libhdfs. Also, only WRONLY and RDONLY are defined in hdfs.h. Please let me know how to use append functionality in this release. Thanks, Arifa.
Re: decommissioned node showing up ad dead node in web based interface to namenode (dfshealth.jsp)
It looks like the behavior is the same with 0.18.2 and 0.19.0. Even though I removed the decommissioned node from the exclude file and run the refreshNode command, the decommissioned node still show up as a dead node. What I did noticed is that if I leave the decommissioned node in the exclude and restart HDFS, the node will show up as a dead node after restart. But then if I remove it from the exclude file and run the refreshNode command, it will disappear from the status page (dfshealth.jsp). So it looks like I will have to stop and start the entire cluster in order to get what I want. Bill On Thu, Jan 29, 2009 at 5:40 PM, Bill Au bill.w...@gmail.com wrote: Not sure why but this does not work for me. I am running 0.18.2. I ran hadoop dfsadmin -refreshNodes after removing the decommissioned node from the exclude file. It still shows up as a dead node. I also removed it from the slaves file and ran the refresh nodes command again. It still shows up as a dead node after that. I am going to upgrade to 0.19.0 to see if it makes any difference. Bill On Tue, Jan 27, 2009 at 7:01 PM, paul paulg...@gmail.com wrote: Once the nodes are listed as dead, if you still have the host names in your conf/exclude file, remove the entries and then run hadoop dfsadmin -refreshNodes. This works for us on our cluster. -paul On Tue, Jan 27, 2009 at 5:08 PM, Bill Au bill.w...@gmail.com wrote: I was able to decommission a datanode successfully without having to stop my cluster. But I noticed that after a node has been decommissioned, it shows up as a dead node in the web base interface to the namenode (ie dfshealth.jsp). My cluster is relatively small and losing a datanode will have performance impact. So I have a need to monitor the health of my cluster and take steps to revive any dead datanode in a timely fashion. So is there any way to altogether get rid of any decommissioned datanode from the web interace of the namenode? Or is there a better way to monitor the health of the cluster? Bill
Re: best way to copy all files from a file system to hdfs
Yes. SequenceFile is splittable, which means it can be broken into chunks, called splits, each of which can be processed by a separate map task. Tom On Mon, Feb 2, 2009 at 3:46 PM, Mark Kerzner markkerz...@gmail.com wrote: No, no reason for a single file - just a little simpler to think about. By the way, can multiple MapReduce workers read the same SequenceFile simultaneously? On Mon, Feb 2, 2009 at 9:42 AM, Tom White t...@cloudera.com wrote: Is there any reason why it has to be a single SequenceFile? You could write a local program to write several block compressed SequenceFiles in parallel (to HDFS), each containing a portion of the files on your PC. Tom On Mon, Feb 2, 2009 at 3:24 PM, Mark Kerzner markkerz...@gmail.com wrote: Truly, I do not see any advantage to doing this, as opposed to writing (Java) code which will copy files to HDFS, because then tarring becomes my bottleneck. Unless I write code measure the file sizes and prepare pointers for multiple tarring tasks. It becomes pretty complex though, and I thought of something simple. I might as well accept that copying one hard drive to HDFS is not going to be parallelized. Mark On Sun, Feb 1, 2009 at 11:44 PM, Philip (flip) Kromer f...@infochimps.orgwrote: Could you tar.bz2 them up (setting up the tar so that it made a few dozen files), toss them onto the HDFS, and use http://stuartsierra.com/2008/04/24/a-million-little-files to go into SequenceFile? This lets you preserve the originals and do the sequence file conversion across the cluster. It's only really helpful, of course, if you also want to prepare a .tar.bz2 so you can clear out the sprawl flip On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner markkerz...@gmail.com wrote: Hi, I am writing an application to copy all files from a regular PC to a SequenceFile. I can surely do this by simply recursing all directories on my PC, but I wonder if there is any way to parallelize this, a MapReduce task even. Tom White's books seems to imply that it will have to be a custom application. Thank you, Mark -- http://www.infochimps.org Connected Open Free Data
Re: Hadoop Streaming Semantics
Thanks for your response. I'm using version 0.19.0 of Hadoop. I tried your suggestion. Here is the line I use to invoke Hadoop hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.19.0-streaming.jar \\ -input /user/hadoop/hadoop-input/inputFile.txt \\ -output /user/hadoop/hadoop-output \\ -mapper map-script.sh \\ -file map-script.sh \\ -file additional-script.rb \\ # Called by map-script.sh -file utils.rb \\ -file env.sh \\ -file aws-s3-credentials-file \\# For permissions to use AWS::S3 -jobconf mapred.reduce.tasks=0 \\ -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat Everything works fine if the -inputformat switch is not included but when I include it I get the following message: ERROR streaming.StreamJob: Job not Successful! and a Runtime exception shows up in the jobtracker log: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 My map functions read each line of the input file and create a directory (one for each line) on Hadoop (in our case S3 Native) in which corresponding data is produced and stored. The name of the created directories are based on the contents of the corresponding line. When I include the -inputformat line above I've noticed that instead of the directories I'm expecting (named after the data found in the input file), the directories are given seemingly arbitrary numeric names; e.g., when the input file contained four lines of data, the directories were named: 0, 273, 546 and 819. Any thoughts? John On Sun, Feb 1, 2009 at 11:00 PM, Amareshwari Sriramadasu amar...@yahoo-inc.com wrote: Which version of hadoop are you using? You can directly use -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat for your streaming job. You need not include it in your streaming jar. -Amareshwari S D wrote: Thanks for your response Amereshwari. I'm unclear on how to take advantage of NLineInputFormat with Hadoop Streaming. Is the idea that I modify the streaming jar file (contrib/streaming/hadoop-version-streaming.jar) to include the NLineInputFormat class and then pass a command line configuration param to indicate that NLineInputFormat should be used? If this is the proper approach, can you point me to an example of what kind of param should be specified? I appreciate your help. Thanks, SD On Thu, Jan 29, 2009 at 10:49 PM, Amareshwari Sriramadasu amar...@yahoo-inc.com wrote: You can use NLineInputFormat for this, which splits one line (N=1, by default) as one split. So, each map task processes one line. See http://hadoop.apache.org/core/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html -Amareshwari S D wrote: Hello, I have a clarifying question about Hadoop streaming. I'm new to the list and didn't see anything posted that covers my questions - my apologies if I overlooked a relevant post. I have an input file consisting of a list of files (one per line) that need to be processed independently of each other. The duration for processing each file is significant - perhaps an hour each. I'm using Hadoop streaming without a reduce function to process each file and save the results (back to S3 native in my case). To handle to long processing time of each file I've set mapred.task.timeout=0 and I have a pretty straight forward Ruby script reading from STDIN: STDIN.each_line do |line| # Get file from contents of line # Process file (long running) end Currently I'm using a cluster of 3 workers in which each worker can have up to 2 tasks running simultaneously. I've noticed that if I have a single input file with many lines (more than 6 given my cluster), then not all workers will be allocated tasks; I've noticed two workers being allocated one task each and the other worker sitting idly. If I split my input file into multiple files (at least 6) then all workers will be immediately allocated the maximum number of tasks that they can handle. My interpretation on this is fuzzy. It seems that Hadoop streaming will take separate input files and allocate a new task per file (up to the maximum constraint) but if given a single input file it is unclear as to whether a new task is allocated per file or line. My understanding of Hadoop Java is that (unlike Hadoop streaming) when given a single input file, the file will be broken up into separate lines and the maximum number of map tasks will automagically be allocated to handle the lines of the file (assuming the use of TextInputFormat). Can someone clarify this? Thanks, SD
RE: HDFS Appends in 0.19.0
Does that mean libhdfs doesn't support append functionality in 0.19.0, but if I write a Java test program to test hdfs append functionality then it should work? Do I need to apply all the patches given at https://issues.apache.org/jira/browse/HADOOP-1700 to test append functionality using TestFileAppend*.java? Or just having hadoop 0.19.0 should work? Thanks, Arifa. -Original Message- From: Craig Macdonald [mailto:cra...@dcs.gla.ac.uk] Sent: Monday, February 02, 2009 3:58 PM To: core-user@hadoop.apache.org Subject: Re: HDFS Appends in 0.19.0 Hi Arifa, The O_APPEND flag is the subject of https://issues.apache.org/jira/browse/HADOOP-4494 Craig Arifa Nisar wrote: Hello All, I am using hadoop 0.19.0, whose release notes includes HADOOP-1700 Introduced append operation for HDFS files. I am trying to test this new feature using my test program. I have experienced that O_APPEND flag added in hdfsopen() is ignored by libhdfs. Also, only WRONLY and RDONLY are defined in hdfs.h. Please let me know how to use append functionality in this release. Thanks, Arifa.
Hadoop User Group UK Meetup - April 14th
I've started organizing the next Hadoop meetup in London, UK. The date is April 14th and the presentations so far include: Michael Stack (Powerset): Apache HBase Isabel Drost (Neofonie): Introducing Apache Mahout Iadh Ounis and Craig Macdonalt (University of Glasgow): Terrier Paolo Castagna (HP): Having Fun with PageRank and MapReduce Keep an eye on the blog for updates: http://huguk.org/ Help in the form of sponsoring (venue, beer etc) would be much appreciated. Also let me know if you want to present. Personally I'd love to see presentations from other Hadoop related projects (pig, hive, hama etc). /Johan
Hadoop's reduce tasks are freezes at 0%.
I'm newbie in Hadoop. and i'm trying to follow Hadoop Quick Guide at hadoop homepage. but, there are some problems... Downloading, unzipping hadoop is done. and ssh successfully operate without password phrase. once... I execute grep example attached to Hadoop... map task is ok. it reaches 100%. but reduce task freezes at 0% without any error message. I've waited it for more than 1 hour, but it still freezes... same job in standalone mode is well done... i tried it with version 0.18.3 and 0.17.2.1. all of them had same problem. could help me to solve this problem? Additionally... I'm working on cloud-infra of GoGrid(Redhat). So, disk's space health is OK. and, i've installed JDK 1.6.11 for linux successfully. - KKwams
hadoop to ftp files into hdfs
Does any one have a good suggestion on how to submit a hadoop job that will split the ftp retrieval of a number of files for insertion into hdfs? I have been searching google for suggestions on this matter. Steve
Re: A record version mismatch occured. Expecting v6, found v32
I tried to use SequenceFile.Writer to convert my binaries into Sequence Files, I read the binary data with FileInputStream, getting all bytes with reader.read(byte[]) , wrote it to a file with SequenceFile.Writer, with parameters NullWritable as key, BytesWritable as value. But the content changes, (I can see that by converting to Base64) Binary File: 73 65 65 65 81 65 65 65 65 65 81 81 65 119 84 81 65 111 67 81 65 52 57 81 65 103 54 81 65 65 97 81 65 65 65 81 ... Sequence File: 73 65 65 65 65 69 65 65 65 65 65 65 65 69 66 65 65 77 66 77 81 103 67 103 67 69 77 65 52 80 86 67 65 73 68 114 ... Thanks for any points.. Rasit 2009/2/2 Rasit OZDAS rasitoz...@gmail.com Hi, I tried to use SequenceFileInputFormat, for this I appended SEQ as first bytes of my binary files (with hex editor). but I get this exception: A record version mismatch occured. Expecting v6, found v32 at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1460) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1428) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1417) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1412) at org.apache.hadoop.mapred.SequenceFileRecordReader.init(SequenceFileRecordReader.java:43) at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321) at org.apache.hadoop.mapred.Child.main(Child.java:155) What could it be? Is it not enough just to add SEQ to binary files? I use Hadoop v.0.19.0 . Thanks in advance.. Rasit different *version* of *Hadoop* between your server and your client. -- M. Raşit ÖZDAŞ -- M. Raşit ÖZDAŞ
Scale Unlimited Professionals Program
Hey All Just wanted to let everyone know that Scale Unlimited will start offering many of its courses heavily discounted, if not free, to independent consultants and contractors. http://www.scaleunlimited.com/programs We are doing this because we receive a number of consulting/ contracting opportunities that we wish to delegate back to trusted consultants and developers. But more importantly, many consultants don't have time to learn Hadoop and related technologies, so Hadoop is often overlooked on new projects. We would like to get more developers comfortable knowing when and when not to use Hadoop on a project, ultimately leading to more projects using Hadoop and to Hadoop becoming more stable and feature rich in the process. We plan to offer our Hadoop Boot Camp for FREE in the Bay Area in the next few weeks. If interested in participating, email me directly. http://www.scaleunlimited.com/courses/hadoop-boot-camp Note this offer is limited to professional independent consultants, contractors, and small boutique contracting firms that are looking to expand their tool base. We only ask for an industry standard referral fee for any projects that result from a referral, if any. To be added to our referral list or if you have a project that might benefit from Hadoop or related technologies, please email me directly. This course will also be announced for open public enrollment in the coming days. cheers, chris -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/
Re: Transferring data between different Hadoop clusters
Thanks for your prompt reply. When using the command ./bin/hadoop distcp hftp://cluster1:50070/path hdfs://cluster2/path - Should this command be given in cluster1? - What does port 50070 specify? Is it the one in fs.default.name, or dfs.http.address? /Taeho On Mon, Feb 2, 2009 at 12:40 PM, Mark Chadwick mchadw...@invitemedia.comwrote: Taeho, The distcp command is perfect for this. If you're copying between two clusters running the same version of Hadoop, you can do something like: ./bin/hadoop distcp hdfs://cluster1/path hdfs://cluster2/path If you're copying between 0.18 and 0.19, the command will look like: ./bin/hadoop distcp hftp://cluster1:50070/path hdfs://cluster2/path Hope that helps, -Mark On Sun, Feb 1, 2009 at 9:48 PM, Taeho Kang tka...@gmail.com wrote: Dear all, There have been times where I needed to transfer some big data from one version of Hadoop cluster to another. (e.g. from hadoop 0.18 to hadoop 0.19 cluster) Other than copying files from one cluster to a local file system and upload it to another, is there a tool that does it? Thanks in advance, Regards, /Taeho
My tasktrackers keep getting lost...
I hope someone can help me out. I'm getting started with Hadoop, have written the firt part of my project (a custom InputFormat), and am now using that to test out my cluster setup. I'm running 0.19.0. I have five dual-core Linux workstations with most of a 250GB disk available for playing, and am controlling things from my Mac Pro. (This is not the production cluster, that hasn't been assembled yet. This is just to get the code working and figure out the bumps.) My test data is about 18GB of web pages, and the test app at the moment just counts the number of web pages in each bundle file. The map jobs run just fine, but when it gets into the reduce, the TaskTrackers all get lost to the JobTracker. I can't see why, because the TaskTrackers are all still running on the slaves. Also, the jobdetails URL starts returning an HTTP 500 error, although other links from that page still work. I've tried going onto the slaves and manually restarting the tasktrackers with hadoop-daemon.sh, and also turning on job restarting in the site conf and then running stop-mapred/start-mapred. The trackers start up and try to clean up and get going again, but they then just get lost again. Here's some error output from the master jobtracker: 2009-02-02 13:39:40,904 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200902021252_0002_r_05_1' from 'tracker_darling:localhost.localdomain/127.0.0.1:58336' 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: attempt_200902021252_0002_m_004592_1 is 796370 ms debug. 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching task attempt_200902021252_0002_m_004592_1 timed out. 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: attempt_200902021252_0002_m_004582_1 is 794199 ms debug. 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching task attempt_200902021252_0002_m_004582_1 timed out. 2009-02-02 13:41:22,271 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_cheyenne:localhost.localdomain/127.0.0.1:52769'; resending the previous 'lost' response 2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_tigris:localhost.localdomain/127.0.0.1:52808'; resending the previous 'lost' response 2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_monocacy:localhost.localdomain/127.0.0.1:54464'; Resending the previous 'lost' response 2009-02-02 13:41:22,298 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_129.6.101.41:127.0.0.1/127.0.0.1:58744'; resending the previous 'lost' response 2009-02-02 13:41:22,421 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_rhone:localhost.localdomain/127.0.0.1:45749'; resending the previous 'lost' response 2009-02-02 13:41:22,421 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 54311 caught: java.lang.NullPointerException at org.apache.hadoop.mapred.MapTask.write(MapTask.java:123) at org.apache.hadoop.mapred.LaunchTaskAction.write(LaunchTaskAction.java :48) at org.apache.hadoop.mapred.HeartbeatResponse.write(HeartbeatResponse.ja va:101) at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:1 59) at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:70) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:907) 2009-02-02 13:41:27,275 WARN org.apache.hadoop.mapred.JobTracker: Status from unknown Tracker : tracker_monocacy:localhost.localdomain/127.0.0.1:54464 And from a slave: 2009-02-02 13:26:39,440 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 129.6.101.18:50060, dest: 129.6.101.12:37304, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_200902021252_0002_m_000111_0 2009-02-02 13:41:40,165 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.io.IOException: Call to rogue/129.6.101.41:54311 failed on local exception: null at org.apache.hadoop.ipc.Client.call(Client.java:699) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy4.heartbeat(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1164) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:997) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1678) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2698) Caused by: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) at org.apache.hadoop.io.UTF8.readChars(UTF8.java:211) at
Re: HDFS issues in 0.17.2.1 and 0.19.0 versions
Are you sure you were using 0.19 not 0.20 ? For 0.17 please check that configuration file hadoop-site.xml exists in your configuration directory is not empty and points to hdfs rather than local file system, which it does buy default. In 0.17 all config variables have been in a common file. 0.19 was the same. 0.20 changed it so now we have hdfs-site.xml, core-site.xml, mapred-site.xml See https://issues.apache.org/jira/browse/HADOOP-4631 Hope this helps. --Konstantin Shyam Sarkar wrote: Hello, I am trying to understand the clustering inside 0.17.2.1 as opposed to 0.19.0 versions. I am trying to create a directory inside 0.17.2.1 HDFS but it creates in Linux FS. However, I can do that in 0.19.0 without any problem. Can someone suggest what should I do for 0.17.2.1 so that I can create directory in HDFS? Thanks, shyam.s.sar...@gmail.com
Book: Hadoop-The Definitive Guide
Hi, I am going through examples in this book (which I have obtained as early draft from Safari), and they all work, with occasional fixes. However, the SequenceFileWriteDemo, even though it works without an error, does not show the create file when I use this command hadoop fs -ls / I remember reading somewhere that the file needs to be at least 64 M to be seen, or something to that effect. How can I see the created file? If you want the code (with a few minor changes) public class SequenceFileWriteDemo { private static final String[] DATA = { One, two, buckle my shoe, Three, four, shut the door, Five, six, pick up sticks, Seven, eight, lay them straight, Nine, ten, a big fat hen }; public static void main(String[] args) throws IOException { String uri = args[0]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); Path path = new Path(uri); IntWritable key = new IntWritable(); Text value = new Text(); SequenceFile.Writer writer = null; try { writer = SequenceFile.createWriter(fs, conf, path, key.getClass(), value.getClass()); int n = 1; for (int i = 0; i n; i++) { key.set(n - i); value.set(DATA[i % DATA.length]); if (i % 100 == 0) { System.out.printf([%s]\t%s\t%s\n, writer.getLength(), key, value); } writer.append(key, value); } } finally { IOUtils.closeStream(writer); } } } Thank you, Mark
Re: Hadoop Streaming Semantics
S D wrote: Thanks for your response. I'm using version 0.19.0 of Hadoop. I tried your suggestion. Here is the line I use to invoke Hadoop hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.19.0-streaming.jar \\ -input /user/hadoop/hadoop-input/inputFile.txt \\ -output /user/hadoop/hadoop-output \\ -mapper map-script.sh \\ -file map-script.sh \\ -file additional-script.rb \\ # Called by map-script.sh -file utils.rb \\ -file env.sh \\ -file aws-s3-credentials-file \\# For permissions to use AWS::S3 -jobconf mapred.reduce.tasks=0 \\ -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat Everything works fine if the -inputformat switch is not included but when I include it I get the following message: ERROR streaming.StreamJob: Job not Successful! and a Runtime exception shows up in the jobtracker log: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 My map functions read each line of the input file and create a directory (one for each line) on Hadoop (in our case S3 Native) in which corresponding data is produced and stored. The name of the created directories are based on the contents of the corresponding line. When I include the -inputformat line above I've noticed that instead of the directories I'm expecting (named after the data found in the input file), the directories are given seemingly arbitrary numeric names; e.g., when the input file contained four lines of data, the directories were named: 0, 273, 546 and 819. LineRecordReader reads line as VALUE and the KEY is offset in the file. Looks like your directories are getting named with KEY. But I don't see any reason for that, because it is working fine with TextInputFormat (both TextInFormat and NLineInputFormat use LineRecordReader.) -Amareshwari Any thoughts? John On Sun, Feb 1, 2009 at 11:00 PM, Amareshwari Sriramadasu amar...@yahoo-inc.com wrote: Which version of hadoop are you using? You can directly use -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat for your streaming job. You need not include it in your streaming jar. -Amareshwari S D wrote: Thanks for your response Amereshwari. I'm unclear on how to take advantage of NLineInputFormat with Hadoop Streaming. Is the idea that I modify the streaming jar file (contrib/streaming/hadoop-version-streaming.jar) to include the NLineInputFormat class and then pass a command line configuration param to indicate that NLineInputFormat should be used? If this is the proper approach, can you point me to an example of what kind of param should be specified? I appreciate your help. Thanks, SD On Thu, Jan 29, 2009 at 10:49 PM, Amareshwari Sriramadasu amar...@yahoo-inc.com wrote: You can use NLineInputFormat for this, which splits one line (N=1, by default) as one split. So, each map task processes one line. See http://hadoop.apache.org/core/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html -Amareshwari S D wrote: Hello, I have a clarifying question about Hadoop streaming. I'm new to the list and didn't see anything posted that covers my questions - my apologies if I overlooked a relevant post. I have an input file consisting of a list of files (one per line) that need to be processed independently of each other. The duration for processing each file is significant - perhaps an hour each. I'm using Hadoop streaming without a reduce function to process each file and save the results (back to S3 native in my case). To handle to long processing time of each file I've set mapred.task.timeout=0 and I have a pretty straight forward Ruby script reading from STDIN: STDIN.each_line do |line| # Get file from contents of line # Process file (long running) end Currently I'm using a cluster of 3 workers in which each worker can have up to 2 tasks running simultaneously. I've noticed that if I have a single input file with many lines (more than 6 given my cluster), then not all workers will be allocated tasks; I've noticed two workers being allocated one task each and the other worker sitting idly. If I split my input file into multiple files (at least 6) then all workers will be immediately allocated the maximum number of tasks that they can handle. My interpretation on this is fuzzy. It seems that Hadoop streaming will take separate input files and allocate a new task per file (up to the maximum constraint) but if given a single input file it is unclear as to whether a new task is allocated per file or line. My understanding of Hadoop Java is that (unlike Hadoop streaming) when given a single input file, the file will be broken up into separate lines and the maximum number of map tasks will automagically be allocated to handle the lines of the file (assuming the use of TextInputFormat). Can someone clarify this? Thanks, SD
Re: Hadoop's reduce tasks are freezes at 0%.
A reduce stall at 0% implies that the map tasks are not outputting any records via the output collector. You need to go look at the task tracker and the task logs on all of your slave machines, to see if anything that seems odd appears in the logs. On the tasktracker web interface detail screen for your job, Are all of the map tasks finished Are any of the map tasks started Are there any Tasktracker nodes to service your job On Sun, Feb 1, 2009 at 11:41 PM, Kwang-Min Choi kmbest.c...@samsung.comwrote: I'm newbie in Hadoop. and i'm trying to follow Hadoop Quick Guide at hadoop homepage. but, there are some problems... Downloading, unzipping hadoop is done. and ssh successfully operate without password phrase. once... I execute grep example attached to Hadoop... map task is ok. it reaches 100%. but reduce task freezes at 0% without any error message. I've waited it for more than 1 hour, but it still freezes... same job in standalone mode is well done... i tried it with version 0.18.3 and 0.17.2.1. all of them had same problem. could help me to solve this problem? Additionally... I'm working on cloud-infra of GoGrid(Redhat). So, disk's space health is OK. and, i've installed JDK 1.6.11 for linux successfully. - KKwams
Re: SequenceFiles, checkpoints, block size (Was: How to flush SequenceFile.Writer?)
If you have to do a time based solution, for now, simply close the file and stage it, then open a new file. Your reads will have to deal with the fact the file is in multiple parts. Warning: Datanodes get pokey if they have large numbers of blocks, and the quickest way to do this is to create a lot of small files. On Mon, Feb 2, 2009 at 9:54 AM, Brian Long br...@dotspots.com wrote: Let me rephrase this problem... as stated below, when I start writing to a SequenceFile from an HDFS client, nothing is visible in HDFS until I've written 64M of data. This presents three problems: fsck reports the file system as corrupt until the first block is finally written out, the presence of the file (without any data) seems to blow up my mapred jobs that try to make use of it under my input path, and finally, I want to basically flush every 15 minutes or so so I can mapred the latest data. I don't see any programmatic way to force the file to flush in 17.2. Additionally, dfs.checkpoint.period does not seem to be obeyed. Does that not do what I think it does? What controls the 64M limit, anyway? Is it dfs.checkpoint.size or dfs.block.size? Is the buffering happening on the client, or on data nodes? Or in the namenode? It seems really bad that a SequenceFile, upon creation, is in an unusable state from the perspective of a mapred job, and also leaves fsck in a corrupt state. Surely I must be doing something wrong... but what? How can I ensure that a SequenceFile is immediately usable (but empty) on creation, and how can I make things flush on some regular time interval? Thanks, Brian On Thu, Jan 29, 2009 at 4:17 PM, Brian Long br...@dotspots.com wrote: I have a SequenceFile.Writer that I obtained via SequenceFile.createWriter and write to using append(key, value). Because the writer volume is low, it's not uncommon for it to take over a day for my appends to finally be flushed to HDFS (e.g. the new file will sit at 0 bytes for over a day). Because I am running map/reduce tasks on this data multiple times a day, I want to flush the sequence file so the mapred jobs can pick it up when they run. What's the right way to do this? I'm assuming it's a fairly common use case. Also -- are writes to the sequence files atomic? (e.g. if I am actively appending to a sequence file, is it always safe to read from that same file in a mapred job?) To be clear, I want the flushing to be time based (controlled explicitly by the app), not size based. Will this create waste in HDFS somehow? Thanks, Brian
Re: problem with completion notification from block movement
This can be made significantly worse by your underlying host file system and the disks that support it. Disabling atime updates via noatime should by you an immediate 10% gain on the block report time. not using a raid 5 controller under the covers should by you a chunk too. I haven't tried it but it may be that ext3 with file_type and dir_index might be faster. Disable journals on your file systems, or put the journal on a different device. The goal of the above set of operations is to try to make it faster for an operation to get file system meta data for each file in a large directory. Using the deadline IO scheduler might help, or otherwise tuning the OS level access to prioritize small reads over large writes. noatime reduces the number of writes being generated by the scan no raid 5 reduces the number of reads needed for ancillary write operations. In general the collected wisdom says to use multiple individual drives for the block storage with a comma separated list for the dfs.data.dir parameter, were each entry on the list is on a separate drive, that preferably only does Datanode service. propertynamedfs.data.dir/namevalue/drive1,/drive2,/drive3,/drive4/value/property Before I left Attributor, there was a though of running a continous find on the dfs.data.dir to try to force the kernel to keep the inodes in memory, but I think they abandoned that strategy. On Mon, Feb 2, 2009 at 10:23 AM, Karl Kleinpaste k...@conviva.com wrote: On Sun, 2009-02-01 at 17:58 -0800, jason hadoop wrote: The Datanode's use multiple threads with locking and one of the assumptions is that the block report (1ce per hour by default) takes little time. The datanode will pause while the block report is running and if it happens to take a while weird things start to happen. Thank you for responding, this is very informative for us. Having looked through the source code with a co-worker regarding periodic scan and then checking the logs once again, we find that we are finding reports of this sort: BlockReport of 1158499 blocks got processed in 308860 msecs BlockReport of 1159840 blocks got processed in 237925 msecs BlockReport of 1161274 blocks got processed in 177853 msecs BlockReport of 1162408 blocks got processed in 285094 msecs BlockReport of 1164194 blocks got processed in 184478 msecs BlockReport of 1165673 blocks got processed in 226401 msecs The 3rd of these exactly straddles the particular example timeline I discussed in my original email about this question. I suspect I'll find more of the same as I look through other related errors. --karl
Re: hadoop to ftp files into hdfs
If you have a large number of ftp urls spread across many sites, simply set that file to be your hadoop job input, and force the input split to be a size that gives you good distribution across your cluster. On Mon, Feb 2, 2009 at 3:23 PM, Steve Morin steve.mo...@gmail.com wrote: Does any one have a good suggestion on how to submit a hadoop job that will split the ftp retrieval of a number of files for insertion into hdfs? I have been searching google for suggestions on this matter. Steve
Re: My tasktrackers keep getting lost...
When I was at Attributor we experienced periodic odd XFS hangs that would freeze up the Hadoop Server processes resulting in them going away. Sometimes XFS would deadlock all writes to the log file and the server would freeze trying to log a message. Can't even JSTACK the jvm. We never had any traction on resolving the XFS deadlocks and simply reboot the machines when the problem occured. On Mon, Feb 2, 2009 at 7:09 PM, Ian Soboroff ian.sobor...@nist.gov wrote: I hope someone can help me out. I'm getting started with Hadoop, have written the firt part of my project (a custom InputFormat), and am now using that to test out my cluster setup. I'm running 0.19.0. I have five dual-core Linux workstations with most of a 250GB disk available for playing, and am controlling things from my Mac Pro. (This is not the production cluster, that hasn't been assembled yet. This is just to get the code working and figure out the bumps.) My test data is about 18GB of web pages, and the test app at the moment just counts the number of web pages in each bundle file. The map jobs run just fine, but when it gets into the reduce, the TaskTrackers all get lost to the JobTracker. I can't see why, because the TaskTrackers are all still running on the slaves. Also, the jobdetails URL starts returning an HTTP 500 error, although other links from that page still work. I've tried going onto the slaves and manually restarting the tasktrackers with hadoop-daemon.sh, and also turning on job restarting in the site conf and then running stop-mapred/start-mapred. The trackers start up and try to clean up and get going again, but they then just get lost again. Here's some error output from the master jobtracker: 2009-02-02 13:39:40,904 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200902021252_0002_r_05_1' from 'tracker_darling:localhost.localdomain/127.0.0.1:58336' 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: attempt_200902021252_0002_m_004592_1 is 796370 ms debug. 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching task attempt_200902021252_0002_m_004592_1 timed out. 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: attempt_200902021252_0002_m_004582_1 is 794199 ms debug. 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching task attempt_200902021252_0002_m_004582_1 timed out. 2009-02-02 13:41:22,271 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_cheyenne:localhost.localdomain/ 127.0.0.1:52769'; resending the previous 'lost' response 2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_tigris:localhost.localdomain/ 127.0.0.1:52808'; resending the previous 'lost' response 2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_monocacy:localhost.localdomain/ 127.0.0.1:54464'; Resending the previous 'lost' response 2009-02-02 13:41:22,298 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_129.6.101.41:127.0.0.1/127.0.0.1:58744'; resending the previous 'lost' response 2009-02-02 13:41:22,421 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_rhone:localhost.localdomain/ 127.0.0.1:45749'; resending the previous 'lost' response 2009-02-02 13:41:22,421 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 54311 caught: java.lang.NullPointerException at org.apache.hadoop.mapred.MapTask.write(MapTask.java:123) at org.apache.hadoop.mapred.LaunchTaskAction.write(LaunchTaskAction.java :48) at org.apache.hadoop.mapred.HeartbeatResponse.write(HeartbeatResponse.ja va:101) at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:1 59) at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:70) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:907) 2009-02-02 13:41:27,275 WARN org.apache.hadoop.mapred.JobTracker: Status from unknown Tracker : tracker_monocacy:localhost.localdomain/ 127.0.0.1:54464 And from a slave: 2009-02-02 13:26:39,440 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 129.6.101.18:50060, dest: 129.6.101.12:37304, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_200902021252_0002_m_000111_0 2009-02-02 13:41:40,165 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.io.IOException: Call to rogue/129.6.101.41:54311 failed on local exception: null at org.apache.hadoop.ipc.Client.call(Client.java:699) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy4.heartbeat(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1164) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:997)
Re: My tasktrackers keep getting lost...
Can u post the output from hadoop-argus-hostname-jobtracker.out -Sagar jason hadoop wrote: When I was at Attributor we experienced periodic odd XFS hangs that would freeze up the Hadoop Server processes resulting in them going away. Sometimes XFS would deadlock all writes to the log file and the server would freeze trying to log a message. Can't even JSTACK the jvm. We never had any traction on resolving the XFS deadlocks and simply reboot the machines when the problem occured. On Mon, Feb 2, 2009 at 7:09 PM, Ian Soboroff ian.sobor...@nist.gov wrote: I hope someone can help me out. I'm getting started with Hadoop, have written the firt part of my project (a custom InputFormat), and am now using that to test out my cluster setup. I'm running 0.19.0. I have five dual-core Linux workstations with most of a 250GB disk available for playing, and am controlling things from my Mac Pro. (This is not the production cluster, that hasn't been assembled yet. This is just to get the code working and figure out the bumps.) My test data is about 18GB of web pages, and the test app at the moment just counts the number of web pages in each bundle file. The map jobs run just fine, but when it gets into the reduce, the TaskTrackers all get lost to the JobTracker. I can't see why, because the TaskTrackers are all still running on the slaves. Also, the jobdetails URL starts returning an HTTP 500 error, although other links from that page still work. I've tried going onto the slaves and manually restarting the tasktrackers with hadoop-daemon.sh, and also turning on job restarting in the site conf and then running stop-mapred/start-mapred. The trackers start up and try to clean up and get going again, but they then just get lost again. Here's some error output from the master jobtracker: 2009-02-02 13:39:40,904 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200902021252_0002_r_05_1' from 'tracker_darling:localhost.localdomain/127.0.0.1:58336' 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: attempt_200902021252_0002_m_004592_1 is 796370 ms debug. 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching task attempt_200902021252_0002_m_004592_1 timed out. 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: attempt_200902021252_0002_m_004582_1 is 794199 ms debug. 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching task attempt_200902021252_0002_m_004582_1 timed out. 2009-02-02 13:41:22,271 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_cheyenne:localhost.localdomain/ 127.0.0.1:52769'; resending the previous 'lost' response 2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_tigris:localhost.localdomain/ 127.0.0.1:52808'; resending the previous 'lost' response 2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_monocacy:localhost.localdomain/ 127.0.0.1:54464'; Resending the previous 'lost' response 2009-02-02 13:41:22,298 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_129.6.101.41:127.0.0.1/127.0.0.1:58744'; resending the previous 'lost' response 2009-02-02 13:41:22,421 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_rhone:localhost.localdomain/ 127.0.0.1:45749'; resending the previous 'lost' response 2009-02-02 13:41:22,421 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 54311 caught: java.lang.NullPointerException at org.apache.hadoop.mapred.MapTask.write(MapTask.java:123) at org.apache.hadoop.mapred.LaunchTaskAction.write(LaunchTaskAction.java :48) at org.apache.hadoop.mapred.HeartbeatResponse.write(HeartbeatResponse.ja va:101) at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:1 59) at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:70) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:907) 2009-02-02 13:41:27,275 WARN org.apache.hadoop.mapred.JobTracker: Status from unknown Tracker : tracker_monocacy:localhost.localdomain/ 127.0.0.1:54464 And from a slave: 2009-02-02 13:26:39,440 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 129.6.101.18:50060, dest: 129.6.101.12:37304, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_200902021252_0002_m_000111_0 2009-02-02 13:41:40,165 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.io.IOException: Call to rogue/129.6.101.41:54311 failed on local exception: null at org.apache.hadoop.ipc.Client.call(Client.java:699) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy4.heartbeat(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1164) at
reading data from multiple output files into a single Map method.
Hi, I am implementing a chain M-R job in java. If I am using multiple reducers, then the output seems to be dispersed among several files on the dfs. How can I now read these files into the Map method of the next job? Another doubt I have is ...Is it possible to keep appending to the same output file while implementing an iterative M-R job? i.e. every MR job append the result to the same output file. Thanks, Ketan
Re: reading data from multiple output files into a single Map method.
Do you really want to have a single task process all of the reduce outputs? If you want all of your output processed by a set of map tasks, you can set the output directory of your previous job to be the input directory of your next job, ensuring that the framework knows how to read the key value pairs from your reduce output. You could set the number of reduces to 1 in your original job, and you will get a single output file from the single reduce task that is run. On Mon, Feb 2, 2009 at 9:34 PM, some speed speed.s...@gmail.com wrote: Hi, I am implementing a chain M-R job in java. If I am using multiple reducers, then the output seems to be dispersed among several files on the dfs. How can I now read these files into the Map method of the next job? Another doubt I have is ...Is it possible to keep appending to the same output file while implementing an iterative M-R job? i.e. every MR job append the result to the same output file. Thanks, Ketan
Re: A record version mismatch occured. Expecting v6, found v32
Thanks, Tom The problem that content was different was that I converted one sample to Base64 byte-by-byte, and converted the other from-byte-array to-byte-array (Strange, that they cause different outputs). Thanks for good points. Rasit 2009/2/2 Tom White t...@cloudera.com The SequenceFile format is described here: http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/SequenceFile.html . The format of the keys and values depends on the serialization classes used. For example, BytesWritable writes out the length of its byte array followed by the actual bytes in the array (see the write() method in BytesWritable). Hope this helps. Tom On Mon, Feb 2, 2009 at 3:21 PM, Rasit OZDAS rasitoz...@gmail.com wrote: I tried to use SequenceFile.Writer to convert my binaries into Sequence Files, I read the binary data with FileInputStream, getting all bytes with reader.read(byte[]) , wrote it to a file with SequenceFile.Writer, with parameters NullWritable as key, BytesWritable as value. But the content changes, (I can see that by converting to Base64) Binary File: 73 65 65 65 81 65 65 65 65 65 81 81 65 119 84 81 65 111 67 81 65 52 57 81 65 103 54 81 65 65 97 81 65 65 65 81 ... Sequence File: 73 65 65 65 65 69 65 65 65 65 65 65 65 69 66 65 65 77 66 77 81 103 67 103 67 69 77 65 52 80 86 67 65 73 68 114 ... Thanks for any points.. Rasit 2009/2/2 Rasit OZDAS rasitoz...@gmail.com Hi, I tried to use SequenceFileInputFormat, for this I appended SEQ as first bytes of my binary files (with hex editor). but I get this exception: A record version mismatch occured. Expecting v6, found v32 at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1460) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1428) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1417) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1412) at org.apache.hadoop.mapred.SequenceFileRecordReader.init(SequenceFileRecordReader.java:43) at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321) at org.apache.hadoop.mapred.Child.main(Child.java:155) What could it be? Is it not enough just to add SEQ to binary files? I use Hadoop v.0.19.0 . Thanks in advance.. Rasit different *version* of *Hadoop* between your server and your client. -- M. Raşit ÖZDAŞ -- M. Raşit ÖZDAŞ -- M. Raşit ÖZDAŞ