null value output from map...
In writing a Map/Reduce job I ran across something I found a little strange. I have a situation where I don't need a value output from map. If I set the value of the value of OutputCollector to null I get the following exception: java.lang.NullPointerException at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:56 2) Looking at the code in MapTask.java ( Hadoop .19.1 ) it makes sense why it would throw the exception: if (value.getClass() != valClass) { throw new IOException("Type mismatch in value from map: expected " + valClass.getName() + ", recieved " + value.getClass().getName()); } I guess my question is as follows: is it a bad idea/not normal to collect a null value in map? Outputting from reduce through TextOutputFormat with a null value as I expect. If the value is null only they key and newline are output. Any thoughts would be appreciated.
Best practices on spliltting an input line?
I have question. I've dabbled with different ways of tokenizing an input file line for processing. I've noticed in my somewhat limited tests that there seem to be some pretty reasonable performance differences between different tokenizing methods. For example, roughly it seems to split a line on tokens ( tab delimited in my case ) that Scanner is the slowest, followed by String.spit and StringTokenizer being the fastest. StringTokenizer, for my application, has the unfortunate characteristic of not returning blank tokens ( i.e., parsing "a,b,c,,d" would return "a","b","c","d" instead of "a","b","c","","d"). The WordCount example uses StringTokenizer which makes sense to me, except I'm currently getting hung up on not returning blank tokens. I did run across the com.Ostermiller.util StringTokenizer replacement that handles null/blank tokens (http://ostermiller.org/utils/StringTokenizer.html ) which seems possible to use, but it sure seems like someone else has solved this problem already better than I have. So, my question is, is there a "best practice" for splitting an input line especially when NULL tokens are expected ( i.e., two consecutive delimiter characters )? Any thoughts would be appreciated Thanks Andy
API Documentation question - WritableComparable
I have a question regarding the Hadoop API documentation for .19. The question is in regard to: http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/Writ ableComparable.html. The document shows the following for the compareTo method: public int compareTo(MyWritableComparable w) { int thisValue = this.value; int thatValue = ((IntWritable)o).value; return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1)); } Taking the full class example doesn't compile. What I _think_ would be right would be: public int compareTo(Object o) { int thisValue = this.value; int thatValue = ((MyWritableComparable)o).value; return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1)); } But even at that it's unclear why the compareTo function is comparing value ( which isn't a member of the class in the example ) and not the counter and timestamp variables in the class. Am I understanding this right? Is there something amiss with the documentation? Thanks Andy
RE: internal/external interfaces for hadoop...
Ah. Thanks. That makes what I was trying to do sound rather ridiculous now, does it. I appreciate the insight. Thanks Andy -Original Message- From: Taeho Kang [mailto:[EMAIL PROTECTED] Sent: Monday, December 08, 2008 6:10 PM To: core-user@hadoop.apache.org Subject: Re: internal/external interfaces for hadoop... When reading from or writing to a file on HDFS, datablocks never go thru the namenode. They are directly handled/transferred between your client and the datanodes that contain the blocks. Hence, datanodes must be accessible by your client. In this case since your client is on an external network, your datanodes must be accessible to external networks. On Tue, Dec 9, 2008 at 8:25 AM, Andy Sautins <[EMAIL PROTECTED]>wrote: > > > I'm trying to setup what I think would be a common hadoop > configuration. I have 4 data nodes on an internal 10.x network. Each > of the data nodes only has access to the 10.x network. The name node > has both an internal 10.x network interface and an external interface. > I want the hdfs filesystem and job tracker to be available on the > external network, but the communication within the cluster to be on the > 10.x network. Is this possible to do? Changing the fs.default.name > configuration parameter I can change the filesystem to listen from the > internal to the external interface, however, then the data nodes can't > communicate to the name node. I also tried setting the fs.default.name > IP address to 0.0.0.0 to see if it would bind to all interfaces, but > that didn't seem to work. > > > > Is it possible to configure hadoop so that the datanodes communicate > on an internal network, but access to hdfs and the job tracker are done > through an external interface? > > > > Any help would be much appreciated. > > > > Thank you > > > > Andy > >
internal/external interfaces for hadoop...
I'm trying to setup what I think would be a common hadoop configuration. I have 4 data nodes on an internal 10.x network. Each of the data nodes only has access to the 10.x network. The name node has both an internal 10.x network interface and an external interface. I want the hdfs filesystem and job tracker to be available on the external network, but the communication within the cluster to be on the 10.x network. Is this possible to do? Changing the fs.default.name configuration parameter I can change the filesystem to listen from the internal to the external interface, however, then the data nodes can't communicate to the name node. I also tried setting the fs.default.name IP address to 0.0.0.0 to see if it would bind to all interfaces, but that didn't seem to work. Is it possible to configure hadoop so that the datanodes communicate on an internal network, but access to hdfs and the job tracker are done through an external interface? Any help would be much appreciated. Thank you Andy
RE: Can mapper get access to filename being processed?
Thanks. map.input.file is exactly what I need. One more question. Is there a way to ignore a file in an input path? So, for example, if the data in hadoop is stored in a directory structure //.txt. So let's say Dec 1, 2008, I have a file from machine a and b, I would have the following directory structure: /20081201/a.txt /20081201/b.txt What I'd like to do is have a job that, depending on the configuration, would either process all files or files for a given machine only ( say a, but not b ). Is that possible to do or am I trying to do something that's using Hadoop in a way that it's not intended to be used? I looked briefly at MultipleInputs which seems to be able to handle different input paths, but not handle a single input path in different ways depending on filename. Thanks again. Andy -Original Message- From: Devaraj Das [mailto:[EMAIL PROTECTED] Sent: Sunday, December 07, 2008 12:11 PM To: core-user@hadoop.apache.org Subject: Re: Can mapper get access to filename being processed? On 12/7/08 11:32 PM, "Andy Sautins" <[EMAIL PROTECTED]> wrote: > > >I'm having trouble finding a way to do what I want, so I'm wondering > if I'm just not looking at the right place or if I'm thinking about the > problem in the wrong way. Any insight would be appreciated. > > > >Let's say I have a directory of files that contains a combination of > different file types. The MapReduce job needs to process all files in > the directory but generates different key/value pairs depending on the > file being processed. What I'd like to do is use the filename to > identify the file type being processed and use that information in the > map job. What it seems like what I'd want is the map job to have access > to the filename of the input file split being processed. I haven't been > able to find out if that is available to a derived class of > MapReduceBase. > > That's map.input.file available in the map via JobConf. The mapper class has to override the implementation of configure in MapReduceBase and get the filename via JobConf.get("map.input.file"). Store that in some field variable of your mapper class. You can then inspect that in your map method. > >Does what I'm trying to do make sense or is there a better way of > processing a job like the one I'm describing? > > Look at MultipleInputs class (in the mapred.lib directory). That could prove useful. > >Thank you > > > >Andy > > > > > > >
Can mapper get access to filename being processed?
I'm having trouble finding a way to do what I want, so I'm wondering if I'm just not looking at the right place or if I'm thinking about the problem in the wrong way. Any insight would be appreciated. Let's say I have a directory of files that contains a combination of different file types. The MapReduce job needs to process all files in the directory but generates different key/value pairs depending on the file being processed. What I'd like to do is use the filename to identify the file type being processed and use that information in the map job. What it seems like what I'd want is the map job to have access to the filename of the input file split being processed. I haven't been able to find out if that is available to a derived class of MapReduceBase. Does what I'm trying to do make sense or is there a better way of processing a job like the one I'm describing? Thank you Andy
RE: Strange behavior with bzip2 input files w/release 0.19.0
Abdul, Please note that I applied patch 4012 version 4 to release 0.19.0 and re-ran tests with mixed results. My simple test ( 20 million simple records ) for both pbzip2/bzip2 generated the same correct results which is great. However, a larger test case ( described in more detail below ) had a discrepancy in the results when compared to gzip and plain text files. bzip2/gzip/text all had produced the same results pre-patch. The bzip2 run had 3 additional records compared to the text/gzip runs post patch. The following are timings and results for a sample dataset running a simple MapReduce job ( MapReduce version of unix 'wc' ). Note the dataset consists of 11 files that are a total of 27G uncompressed, 4.5G gzip compressed and 3.1G bzip2 compressed. All 3 datasets are identical and produce the same md5sum. Also the bzip2 files in the test were compressed using bzip2, not pbzip2. Release .19.0 Pre patch: TypeTiming MapReduce Result -- Gzip - 4m55s323,234,098 Bzip2 - 16m14s 323,234,098 Txt - 6m23s323,234,098 Release .19.0 Post patch 4012 Version 4 ( w /results ) TypeTiming MapReduce Result -- Gzip - 5m14s332,234,098 Bzip2 - 9m36s332,234,101 Txt - 6m28s332.234.098 Both Gzip/Txt timings were about the same between runs. Bzip2 elapsed time was reduced significantly. So, generally positive although looks like there might be an edge-case causing slightly different results. I'll work on putting together a test case of manageable size that re-produces the result discrepancy. Thanks again for the help. Andy -Original Message----- From: Andy Sautins [mailto:[EMAIL PROTECTED] Sent: Thursday, December 04, 2008 2:29 PM To: core-user@hadoop.apache.org Subject: RE: Strange behavior with bzip2 input files w/release 0.19.0 Thanks Abdul. Very exciting that hadoop will soon be able to handle not only pbzip2 files but also be able to split bzip2 files. I will apply the patch and report back. Thank you Andy -Original Message- From: Abdul Qadeer [mailto:[EMAIL PROTECTED] Sent: Thursday, December 04, 2008 1:49 PM To: core-user@hadoop.apache.org Subject: Re: Strange behavior with bzip2 input files w/release 0.19.0 Andy, As was mentioned earlier that splitting support is being added for bzip2 files and actually patch is under review now. I think, pbzip2 generated files should work fine with that because the split algorithm finds the next start of block marker and does not use end of stream marker. We rather use physical end of file to know when stream ends. So if you see at https://issues.apache.org/jira/browse/HADOOP-4012 you can download version 4 patch and apply it on Hadoop code and see if its working for you or you can wait for the review process to complete so that code becomes a part of standard Hadoop. You can add yourself as a watcher there at JIRA 4012, so that you know when its done. Please let me know, if pbzip2 generated files does not work even on that code. Thank you, Abdul Qadeer On Thu, Dec 4, 2008 at 11:46 AM, Andy Sautins <[EMAIL PROTECTED]>wrote: > > Thanks for the response Abdul. > > So, the bzip2 file in question is _kindof_ a concatenation of > multiple bzip2 files. It's not concatenated using cat a.bz2 b.bz2 > > yourFile.bz2, but it is created using pbzip2 ( pbzip2 v1.0.2 running on > CentOS 5.2 installed from the EPEL repository ). My understanding is > that pbzip does roughly what you're saying and concatenates in some > manner. > > I created a simple test case that reproduces the behavior. I created > a file using the following perl script: > > for($i=0;$i<2000;$i++) { > print "Line $i\n"; > } > >I then created two different bzip2 files. One with bzip2 and one > with pbzip2. The do have different sizes: > > 21994233 simple.bzip2.txt.bz2 > 21999416 simple.pbzip2.txt.bz2 > >They do decompress to give the same output file > bunzip2 -c simple.bzip2.txt.bz2 | md5sum > 581ad242e6cf22650072edd44d6a2d38 - > > bunzip2 -c simple.pbzip2.txt.bz2 | md5sum > 581ad242e6cf22650072edd44d6a2d38 - > > Running both through the simple line count MapReduce job I get the > same behavior where bzip2 correctly calculates 20,000,000 records, but > the pbzip2 generated file only processes the first block ( 82,829 > records ). > > So, it sounds like what you're saying of having multiple end of > stream markers makes sense. I will say it would be very beneficial to > be able to use pbzip2 generated files to compress hadoop input files. > Using pbzip2 can greatly reduce the amount of time required to bzip2 > compress files and seems to generate a valid bzip2 file ( at least it > bunzip2 decompresses cor
RE: Strange behavior with bzip2 input files w/release 0.19.0
Thanks Abdul. Very exciting that hadoop will soon be able to handle not only pbzip2 files but also be able to split bzip2 files. I will apply the patch and report back. Thank you Andy -Original Message- From: Abdul Qadeer [mailto:[EMAIL PROTECTED] Sent: Thursday, December 04, 2008 1:49 PM To: core-user@hadoop.apache.org Subject: Re: Strange behavior with bzip2 input files w/release 0.19.0 Andy, As was mentioned earlier that splitting support is being added for bzip2 files and actually patch is under review now. I think, pbzip2 generated files should work fine with that because the split algorithm finds the next start of block marker and does not use end of stream marker. We rather use physical end of file to know when stream ends. So if you see at https://issues.apache.org/jira/browse/HADOOP-4012 you can download version 4 patch and apply it on Hadoop code and see if its working for you or you can wait for the review process to complete so that code becomes a part of standard Hadoop. You can add yourself as a watcher there at JIRA 4012, so that you know when its done. Please let me know, if pbzip2 generated files does not work even on that code. Thank you, Abdul Qadeer On Thu, Dec 4, 2008 at 11:46 AM, Andy Sautins <[EMAIL PROTECTED]>wrote: > > Thanks for the response Abdul. > > So, the bzip2 file in question is _kindof_ a concatenation of > multiple bzip2 files. It's not concatenated using cat a.bz2 b.bz2 > > yourFile.bz2, but it is created using pbzip2 ( pbzip2 v1.0.2 running on > CentOS 5.2 installed from the EPEL repository ). My understanding is > that pbzip does roughly what you're saying and concatenates in some > manner. > > I created a simple test case that reproduces the behavior. I created > a file using the following perl script: > > for($i=0;$i<2000;$i++) { > print "Line $i\n"; > } > >I then created two different bzip2 files. One with bzip2 and one > with pbzip2. The do have different sizes: > > 21994233 simple.bzip2.txt.bz2 > 21999416 simple.pbzip2.txt.bz2 > >They do decompress to give the same output file > bunzip2 -c simple.bzip2.txt.bz2 | md5sum > 581ad242e6cf22650072edd44d6a2d38 - > > bunzip2 -c simple.pbzip2.txt.bz2 | md5sum > 581ad242e6cf22650072edd44d6a2d38 - > > Running both through the simple line count MapReduce job I get the > same behavior where bzip2 correctly calculates 20,000,000 records, but > the pbzip2 generated file only processes the first block ( 82,829 > records ). > > So, it sounds like what you're saying of having multiple end of > stream markers makes sense. I will say it would be very beneficial to > be able to use pbzip2 generated files to compress hadoop input files. > Using pbzip2 can greatly reduce the amount of time required to bzip2 > compress files and seems to generate a valid bzip2 file ( at least it > bunzip2 decompresses correctly ). > > Thank you > > Andy > > -Original Message- > From: Abdul Qadeer [mailto:[EMAIL PROTECTED] > Sent: Thursday, December 04, 2008 12:07 PM > To: core-user@hadoop.apache.org > Subject: Re: Strange behavior with bzip2 input files w/release 0.19.0 > > Andy, > > As you said, you suspect that only one bzip2 block is being decompressed > and used; is you bzip2 file the concatenation of multiple bzip2 files > (i.e. > are > you doing something like cat a.bz2 b.bz2 c.bz2 > yourFile.bz2 ?) In > such > a case, there will be many bzip2 end of stream markers in a single file > and > bzip2 decomprssor will stop on encountering the first end of block > marker > when in fact, the stream has more data in it. > > If this is not the case, then bzip2 should work as gzip or plaintext are > working. > Currently only one mapper gets the whole file (just like gzip and > splitting > support > for bzip is being added in HADOOP-4012, as Alex mentioned). The > LineRecordReader > get the uncompressed data and does rest of the things same as in the > case > of gzip or plaintext. So can you provide your bzip2 compressed file? > (May > be > uploading it somewhere and sending in the link) I will look into this > issue. > > > Abdul Qadeer > > On Thu, Dec 4, 2008 at 9:11 AM, Andy Sautins > <[EMAIL PROTECTED]>wrote: > > > > > > >I'm seeing some strange behavior with bzip2 files and release > > 0.19.0. I'm wondering if anyone can shed some light on what I'm > seeing. > > Basically it _looks_ like the processing of a particular bzip2 input > > file is stopping after the first bzip2 block. Below is a comparison > of > > tests between a .gz file which seems to do what I expect, and the > same > &
RE: Strange behavior with bzip2 input files w/release 0.19.0
Thanks for the response Abdul. So, the bzip2 file in question is _kindof_ a concatenation of multiple bzip2 files. It's not concatenated using cat a.bz2 b.bz2 > yourFile.bz2, but it is created using pbzip2 ( pbzip2 v1.0.2 running on CentOS 5.2 installed from the EPEL repository ). My understanding is that pbzip does roughly what you're saying and concatenates in some manner. I created a simple test case that reproduces the behavior. I created a file using the following perl script: for($i=0;$i<2000;$i++) { print "Line $i\n"; } I then created two different bzip2 files. One with bzip2 and one with pbzip2. The do have different sizes: 21994233 simple.bzip2.txt.bz2 21999416 simple.pbzip2.txt.bz2 They do decompress to give the same output file bunzip2 -c simple.bzip2.txt.bz2 | md5sum 581ad242e6cf22650072edd44d6a2d38 - bunzip2 -c simple.pbzip2.txt.bz2 | md5sum 581ad242e6cf22650072edd44d6a2d38 - Running both through the simple line count MapReduce job I get the same behavior where bzip2 correctly calculates 20,000,000 records, but the pbzip2 generated file only processes the first block ( 82,829 records ). So, it sounds like what you're saying of having multiple end of stream markers makes sense. I will say it would be very beneficial to be able to use pbzip2 generated files to compress hadoop input files. Using pbzip2 can greatly reduce the amount of time required to bzip2 compress files and seems to generate a valid bzip2 file ( at least it bunzip2 decompresses correctly ). Thank you Andy -Original Message- From: Abdul Qadeer [mailto:[EMAIL PROTECTED] Sent: Thursday, December 04, 2008 12:07 PM To: core-user@hadoop.apache.org Subject: Re: Strange behavior with bzip2 input files w/release 0.19.0 Andy, As you said, you suspect that only one bzip2 block is being decompressed and used; is you bzip2 file the concatenation of multiple bzip2 files (i.e. are you doing something like cat a.bz2 b.bz2 c.bz2 > yourFile.bz2 ?) In such a case, there will be many bzip2 end of stream markers in a single file and bzip2 decomprssor will stop on encountering the first end of block marker when in fact, the stream has more data in it. If this is not the case, then bzip2 should work as gzip or plaintext are working. Currently only one mapper gets the whole file (just like gzip and splitting support for bzip is being added in HADOOP-4012, as Alex mentioned). The LineRecordReader get the uncompressed data and does rest of the things same as in the case of gzip or plaintext. So can you provide your bzip2 compressed file? (May be uploading it somewhere and sending in the link) I will look into this issue. Abdul Qadeer On Thu, Dec 4, 2008 at 9:11 AM, Andy Sautins <[EMAIL PROTECTED]>wrote: > > >I'm seeing some strange behavior with bzip2 files and release > 0.19.0. I'm wondering if anyone can shed some light on what I'm seeing. > Basically it _looks_ like the processing of a particular bzip2 input > file is stopping after the first bzip2 block. Below is a comparison of > tests between a .gz file which seems to do what I expect, and the same > file .bz2 which doesn't behave as I expect. > > > >I have the same file stored in hadoop compressed as both bzip2 and > gz formats. The uncompressed file size is 660,841,894 bytes. Comparing > the files they both seem to be valid archives of the exact same file. > > > > /usr/local/hadoop/bin/hadoop dfs -cat > bzip2.example/data.bz2/file.txt.bz2 | bunzip2 -c | md5sum > > 2c82901170f44245fb04d24ad4746e38 - > > > > /usr/local/hadoop/bin/hadoop dfs -cat bzip2.example/data.gz/file.txt.gz > | gunzip -c | md5sum > > 2c82901170f44245fb04d24ad4746e38 - > > > >Given the md5 sums match it seems like the files are the same and > uncompress correctly. > > > >Now when I run a simple Map/Reduce application that just counts > lines in the file I get different results. > > > > Expected Results: > > > > /usr/local/hadoop/bin/hadoop dfs -cat > bzip2.bug.example/data.gz/file.txt.gz | gunzip -c | wc -l > > 6884024 > > > > Gzip input file Results: 6,884,024 > > Bzip2 input file Results: 9,420 > > > > > > Looking at the task log files the MAP_INPUT_BYTES of the .gz file > looks correct ([(MAP_INPUT_BYTES)(Map input bytes)(660,841,894)] ) and > matches the size of the uncompressed file. However, looking at > MAP_INPUT_BYTES for the .bz2 file it's 900,000 ([(MAP_INPUT_BYTES)(Map > input bytes)(90)] ) which matches the block size of the bzip2 > compressed file. So that makes me think for some reason that only the > first bzip2 block of the bzip2 compressed file is being processed. > > > >So I'm wondering if my analysis is correct and if there could be an > issue with the processing of bzip2 input files. > > > > Andy > >
Strange behavior with bzip2 input files w/release 0.19.0
I'm seeing some strange behavior with bzip2 files and release 0.19.0. I'm wondering if anyone can shed some light on what I'm seeing. Basically it _looks_ like the processing of a particular bzip2 input file is stopping after the first bzip2 block. Below is a comparison of tests between a .gz file which seems to do what I expect, and the same file .bz2 which doesn't behave as I expect. I have the same file stored in hadoop compressed as both bzip2 and gz formats. The uncompressed file size is 660,841,894 bytes. Comparing the files they both seem to be valid archives of the exact same file. /usr/local/hadoop/bin/hadoop dfs -cat bzip2.example/data.bz2/file.txt.bz2 | bunzip2 -c | md5sum 2c82901170f44245fb04d24ad4746e38 - /usr/local/hadoop/bin/hadoop dfs -cat bzip2.example/data.gz/file.txt.gz | gunzip -c | md5sum 2c82901170f44245fb04d24ad4746e38 - Given the md5 sums match it seems like the files are the same and uncompress correctly. Now when I run a simple Map/Reduce application that just counts lines in the file I get different results. Expected Results: /usr/local/hadoop/bin/hadoop dfs -cat bzip2.bug.example/data.gz/file.txt.gz | gunzip -c | wc -l 6884024 Gzip input file Results: 6,884,024 Bzip2 input file Results: 9,420 Looking at the task log files the MAP_INPUT_BYTES of the .gz file looks correct ([(MAP_INPUT_BYTES)(Map input bytes)(660,841,894)] ) and matches the size of the uncompressed file. However, looking at MAP_INPUT_BYTES for the .bz2 file it's 900,000 ([(MAP_INPUT_BYTES)(Map input bytes)(90)] ) which matches the block size of the bzip2 compressed file. So that makes me think for some reason that only the first bzip2 block of the bzip2 compressed file is being processed. So I'm wondering if my analysis is correct and if there could be an issue with the processing of bzip2 input files. Andy