when run mrbench on pseudo distributed failed.....
Hi guys, execute command : bin/hadoop jar build/hadoop-0.20.3-dev-test.jar mrbench -numRuns 5 error message: [hadoop@localhost hadoop]$ bin/hadoop jar build/hadoop-0.20.3-dev-test.jar mrbench -numRuns 5 MRBenchmark.0.0.2 12/02/17 16:45:11 INFO mapred.MRBench: creating control file: 1 numLines, ASCENDING sortOrder 12/02/17 16:45:11 INFO mapred.MRBench: created control file: /benchmarks/MRBench/mr_input/input_-650014558.txt java.lang.NullPointerException at java.util.Hashtable.put(Hashtable.java:394) at java.util.Properties.setProperty(Properties.java:143) at org.apache.hadoop.conf.Configuration.set(Configuration.java:404) at org.apache.hadoop.mapred.JobConf.setJar(JobConf.java:243) at org.apache.hadoop.mapred.MRBench.runJobInSequence(MRBench.java:177) at org.apache.hadoop.mapred.MRBench.main(MRBench.java:280) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.hadoop.test.AllTestDriver.main(AllTestDriver.java:81) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Fwd: when run mrbench on pseudo distributed failed.....
Hi guys, execute command : bin/hadoop jar build/hadoop-0.20.3-dev-test.jar mrbench -numRuns 5 error message: [hadoop@localhost hadoop]$ bin/hadoop jar build/hadoop-0.20.3-dev-test.jar mrbench -numRuns 5 MRBenchmark.0.0.2 12/02/17 16:45:11 INFO mapred.MRBench: creating control file: 1 numLines, ASCENDING sortOrder 12/02/17 16:45:11 INFO mapred.MRBench: created control file: /benchmarks/MRBench/mr_input/input_-650014558.txt java.lang.NullPointerException at java.util.Hashtable.put(Hashtable.java:394) at java.util.Properties.setProperty(Properties.java:143) at org.apache.hadoop.conf.Configuration.set(Configuration.java:404) at org.apache.hadoop.mapred.JobConf.setJar(JobConf.java:243) at org.apache.hadoop.mapred.MRBench.runJobInSequence(MRBench.java:177) at org.apache.hadoop.mapred.MRBench.main(MRBench.java:280) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.hadoop.test.AllTestDriver.main(AllTestDriver.java:81) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Hadoop Example in java
Hi All, I am looking for example in java for hadoop. I have done lots of search but I have only found word count. Are there any other exapmple for the same. -- View this message in context: http://old.nabble.com/Hadoop-Example-in-java-tp33341353p33341353.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Hadoop Example in java
For more framework-provided examples, also take a look at your downloaded distributions' src/examples directory. I also suggest getting 'Hadoop: The Definitive Guide by Tom White (O'Reilly) to get started with, it carries examples and all other information useful for using/deploying/developing with Apache Hadoop. On Fri, Feb 17, 2012 at 2:30 PM, vikas jain mailvkj...@gmail.com wrote: Hi All, I am looking for example in java for hadoop. I have done lots of search but I have only found word count. Are there any other exapmple for the same. -- View this message in context: http://old.nabble.com/Hadoop-Example-in-java-tp33341353p33341353.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Harsh J Customer Ops. Engineer Cloudera | http://tiny.cloudera.com/about
Building Hadoop UI
Hello everyone, in order to provide our clients a custom UI for their MapReduce jobs and HDFS files, what is the best solution to create a web-based UI for Hadoop? We are not going to use Cloudera HUE, we need something more user-friendly and shaped for our clients needs. Thanks, Fabio Pitzolu
Re: Building Hadoop UI
You could use something like Spring, but you'll need to figure ways to connect and integrate and it'll be a homegrown solution. Peter Jamack On 2/17/12 5:52 AM, fabio.pitz...@gmail.com fabio.pitz...@gmail.com wrote: Hello everyone, in order to provide our clients a custom UI for their MapReduce jobs and HDFS files, what is the best solution to create a web-based UI for Hadoop? We are not going to use Cloudera HUE, we need something more user-friendly and shaped for our clients needs. Thanks, Fabio Pitzolu
Re: Building Hadoop UI
Thanks Peter, I'll give Spring a try, also because there is also a .NET version of this framework, which is basically what my company is looking for. Fabio Pitzolu fabio.pitz...@gmail.com Il giorno 17/feb/2012, alle ore 15:31, Jamack, Peter ha scritto: You could use something like Spring, but you'll need to figure ways to connect and integrate and it'll be a homegrown solution. Peter Jamack On 2/17/12 5:52 AM, fabio.pitz...@gmail.com fabio.pitz...@gmail.com wrote: Hello everyone, in order to provide our clients a custom UI for their MapReduce jobs and HDFS files, what is the best solution to create a web-based UI for Hadoop? We are not going to use Cloudera HUE, we need something more user-friendly and shaped for our clients needs. Thanks, Fabio Pitzolu
Re: Building Hadoop UI
We in our company have integrated Liferay which is an open source java portal with hbase using its java api by creating an extension to its document library portlet to show content. We are indexing content separately on a solr index server where you can search and that content (image, docs, etc.) Is pulled from hbase hadoop. Regards, Neil Sent on my BlackBerry® from Vodafone -Original Message- From: fabio.pitz...@gmail.com Date: Fri, 17 Feb 2012 17:02:57 To: common-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: Re: Building Hadoop UI Thanks Peter, I'll give Spring a try, also because there is also a .NET version of this framework, which is basically what my company is looking for. Fabio Pitzolu fabio.pitz...@gmail.com Il giorno 17/feb/2012, alle ore 15:31, Jamack, Peter ha scritto: You could use something like Spring, but you'll need to figure ways to connect and integrate and it'll be a homegrown solution. Peter Jamack On 2/17/12 5:52 AM, fabio.pitz...@gmail.com fabio.pitz...@gmail.com wrote: Hello everyone, in order to provide our clients a custom UI for their MapReduce jobs and HDFS files, what is the best solution to create a web-based UI for Hadoop? We are not going to use Cloudera HUE, we need something more user-friendly and shaped for our clients needs. Thanks, Fabio Pitzolu
Re: Building Hadoop UI
Thanks Neil, I'll get some documentation for that. Can we maybe get in touch so you can explain me a little more of that integration? Fabio 2012/2/17 neil.harw...@gmail.com We in our company have integrated Liferay which is an open source java portal with hbase using its java api by creating an extension to its document library portlet to show content. We are indexing content separately on a solr index server where you can search and that content (image, docs, etc.) Is pulled from hbase hadoop. Regards, Neil Sent on my BlackBerry® from Vodafone -Original Message- From: fabio.pitz...@gmail.com Date: Fri, 17 Feb 2012 17:02:57 To: common-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: Re: Building Hadoop UI Thanks Peter, I'll give Spring a try, also because there is also a .NET version of this framework, which is basically what my company is looking for. Fabio Pitzolu fabio.pitz...@gmail.com Il giorno 17/feb/2012, alle ore 15:31, Jamack, Peter ha scritto: You could use something like Spring, but you'll need to figure ways to connect and integrate and it'll be a homegrown solution. Peter Jamack On 2/17/12 5:52 AM, fabio.pitz...@gmail.com fabio.pitz...@gmail.com wrote: Hello everyone, in order to provide our clients a custom UI for their MapReduce jobs and HDFS files, what is the best solution to create a web-based UI for Hadoop? We are not going to use Cloudera HUE, we need something more user-friendly and shaped for our clients needs. Thanks, Fabio Pitzolu
Re: Hadoop Example in java
On Fri, Feb 17, 2012 at 1:00 AM, vikas jain mailvkj...@gmail.com wrote: Hi All, I am looking for example in java for hadoop. I have done lots of search but I have only found word count. Are there any other exapmple for the same. If you want to find them on the web, you can look in subversion: http://svn.apache.org/repos/asf/hadoop/common/branches/branch-1/src/examples/org/apache/hadoop/examples/ -- Owen
Re: location of HDFS directory on my Local system
Hi Sujit, I'd recommend you look at this tutorial on interacting with HDFS: http://developer.yahoo.com/hadoop/tutorial/module2.html#interacting When you create a folder '/foodir' in HDFS, that folder will be in the top level of HDFS. If you were to create a directory 'foodir' (without the '/'), that directory would go into your user's folder on HDFS - '/user/username/foodir. You can view HDFS directory listings via: ~bin/hadoop dfs -lsr / Rohit Bakhshi www.hortonworks.com (http://www.hortonworks.com/) On Friday, February 17, 2012 at 12:18 PM, Sujit Dhamale wrote: Hi,i am new to Hadoop i install Hadoop cluster on my Ubuntu using Below command i created one folder *foodir* *~bin/hadoop dfs -mkdir /foodir* in my file system , what will be location of folder *foodir* ??
Re: location of HDFS directory on my Local system
To add onto Rohit's great response on HDFS interaction, and to clear your confusion here, HDFS does not exactly make itself available as a physical filesystem like say ext3/4 mount points. The usual way of interacting with them is by communicating with the services you run, via RPC calls (over a network). That said, you can still use the FUSE-DFS connector from HDFS, to mount HDFS as a local file system, and run most of your regular *nix programs (ls, cat, etc.) on its files after it has been mounted. To mount HDFS via fuse-dfs drivers, follow http://wiki.apache.org/hadoop/MountableHDFS On Sat, Feb 18, 2012 at 2:06 AM, Rohit ro...@hortonworks.com wrote: Hi Sujit, I'd recommend you look at this tutorial on interacting with HDFS: http://developer.yahoo.com/hadoop/tutorial/module2.html#interacting When you create a folder '/foodir' in HDFS, that folder will be in the top level of HDFS. If you were to create a directory 'foodir' (without the '/'), that directory would go into your user's folder on HDFS - '/user/username/foodir. You can view HDFS directory listings via: ~bin/hadoop dfs -lsr / Rohit Bakhshi www.hortonworks.com (http://www.hortonworks.com/) On Friday, February 17, 2012 at 12:18 PM, Sujit Dhamale wrote: Hi,i am new to Hadoop i install Hadoop cluster on my Ubuntu using Below command i created one folder *foodir* *~bin/hadoop dfs -mkdir /foodir* in my file system , what will be location of folder *foodir* ?? -- Harsh J Customer Ops. Engineer Cloudera | http://tiny.cloudera.com/about
Re: Addendum to Hypertable vs. HBase Performance Test (w/ mslab enabled)
I would almost agree with prospective. But their is a problem with 'java is slow' theory. The reason is that in a 100 percent write workload gc might be a factor. But in the real world people have to read data and read becomes disk bound as your data gets larger then memory. Unless C++ can make your disk spin faster then java It is a wash. Making a claim that your going to need more servers for java/hbase is bogus. To put it in prospective, if the workload is 5 % write and 95 % read you are probably going to need just the same amount of hardware. You might get some win on the read size because your custom caching could be more efficient in terms of object size in memory and other gc issues but it is not 2 or 3 to one. If a million writes fall into a hypertable forest but it take a billion years to read them back did the writes ever sync :) On Monday, February 13, 2012, Doug Judd d...@hypertable.com wrote: Hey Todd, Bulk loading isn't always an option when data is streaming in from a live application. Many big data use cases involve massive amounts of smaller items in the size range of 10-100 bytes, for example URLs, sensor readings, genome sequence reads, network traffic logs, etc. If HBase requires 2-3 times the amount of hardware to avoid *Concurrent mode failures*, then that makes HBase 2-3 times more expensive from the standpoint of hardware, power consumption, and datacenter real estate. What takes the most time is getting the core database mechanics right (we're going on 5 years now). Once the core database is stable, integration with applications such as Solr and others are short term projects. I believe that sooner or later, most engineers working in this space will come to the conclusion that Java is the wrong language for this kind of database application. At that point, folks on the HBase project will realize that they are five years behind. - Doug On Mon, Feb 13, 2012 at 11:33 AM, Todd Lipcon t...@cloudera.com wrote: Hey Doug, Want to also run a comparison test with inter-cluster replication turned on? How about kerberos-based security on secure HDFS? How about ACLs or other table permissions even without strong authentication? Can you run a test comparing performance running on top of Hadoop 0.23? How about running other ecosystem products like Solbase, Havrobase, and Lily, or commercial products like Digital Reasoning's Synthesys, etc? For those unfamiliar, the answer to all of the above is that those comparisons can't be run because Hypertable is years behind HBase in terms of features, adoption, etc. They've found a set of benchmarks they win at, but bulk loading either database through the put API is the wrong way to go about it anyway. Anyone loading 5T of data like this would use the bulk load APIs which are one to two orders of magnitude more efficient. Just ask the Yahoo crawl cache team, who has ~1PB stored in HBase, or Facebook, or eBay, or many others who store hundreds to thousands of TBs in HBase today. Thanks, -Todd On Mon, Feb 13, 2012 at 9:07 AM, Doug Judd d...@hypertable.com wrote: In our original test, we mistakenly ran the HBase test with the hbase.hregion.memstore.mslab.enabled property set to false. We re-ran the test with the hbase.hregion.memstore.mslab.enabled property set to true and have reported the results in the following addendum: Addendum to Hypertable vs. HBase Performance Test http://www.hypertable.com/why_hypertable/hypertable_vs_hbase_2/addendum/ Synopsis: It slowed performance on the 10KB and 1KB tests and still failed the 100 byte and 10 byte tests with *Concurrent mode failure* - Doug -- Todd Lipcon Software Engineer, Cloudera
Re: Addendum to Hypertable vs. HBase Performance Test (w/ mslab enabled)
Hi Edward, The problem is that even if the workload is 5% write and 95% read, if you can't load the data, you need more machines. In the 167 billion insert test, HBase failed with *Concurrent mode failure* after 20% of the data was loaded. One of our customers has loaded 1/2 trillion records of historical financial market data on 16 machines. If you do the back-of-the-envelope calculation, it would take about 180 machines for HBase to load 1/2 trillion cells. That makes HBase 10X more expensive in terms of hardware, power consumption, and data center real estate. - Doug On Fri, Feb 17, 2012 at 3:58 PM, Edward Capriolo edlinuxg...@gmail.comwrote: I would almost agree with prospective. But their is a problem with 'java is slow' theory. The reason is that in a 100 percent write workload gc might be a factor. But in the real world people have to read data and read becomes disk bound as your data gets larger then memory. Unless C++ can make your disk spin faster then java It is a wash. Making a claim that your going to need more servers for java/hbase is bogus. To put it in prospective, if the workload is 5 % write and 95 % read you are probably going to need just the same amount of hardware. You might get some win on the read size because your custom caching could be more efficient in terms of object size in memory and other gc issues but it is not 2 or 3 to one. If a million writes fall into a hypertable forest but it take a billion years to read them back did the writes ever sync :) On Monday, February 13, 2012, Doug Judd d...@hypertable.com wrote: Hey Todd, Bulk loading isn't always an option when data is streaming in from a live application. Many big data use cases involve massive amounts of smaller items in the size range of 10-100 bytes, for example URLs, sensor readings, genome sequence reads, network traffic logs, etc. If HBase requires 2-3 times the amount of hardware to avoid *Concurrent mode failures*, then that makes HBase 2-3 times more expensive from the standpoint of hardware, power consumption, and datacenter real estate. What takes the most time is getting the core database mechanics right (we're going on 5 years now). Once the core database is stable, integration with applications such as Solr and others are short term projects. I believe that sooner or later, most engineers working in this space will come to the conclusion that Java is the wrong language for this kind of database application. At that point, folks on the HBase project will realize that they are five years behind. - Doug On Mon, Feb 13, 2012 at 11:33 AM, Todd Lipcon t...@cloudera.com wrote: Hey Doug, Want to also run a comparison test with inter-cluster replication turned on? How about kerberos-based security on secure HDFS? How about ACLs or other table permissions even without strong authentication? Can you run a test comparing performance running on top of Hadoop 0.23? How about running other ecosystem products like Solbase, Havrobase, and Lily, or commercial products like Digital Reasoning's Synthesys, etc? For those unfamiliar, the answer to all of the above is that those comparisons can't be run because Hypertable is years behind HBase in terms of features, adoption, etc. They've found a set of benchmarks they win at, but bulk loading either database through the put API is the wrong way to go about it anyway. Anyone loading 5T of data like this would use the bulk load APIs which are one to two orders of magnitude more efficient. Just ask the Yahoo crawl cache team, who has ~1PB stored in HBase, or Facebook, or eBay, or many others who store hundreds to thousands of TBs in HBase today. Thanks, -Todd On Mon, Feb 13, 2012 at 9:07 AM, Doug Judd d...@hypertable.com wrote: In our original test, we mistakenly ran the HBase test with the hbase.hregion.memstore.mslab.enabled property set to false. We re-ran the test with the hbase.hregion.memstore.mslab.enabled property set to true and have reported the results in the following addendum: Addendum to Hypertable vs. HBase Performance Test http://www.hypertable.com/why_hypertable/hypertable_vs_hbase_2/addendum/ Synopsis: It slowed performance on the 10KB and 1KB tests and still failed the 100 byte and 10 byte tests with *Concurrent mode failure* - Doug -- Todd Lipcon Software Engineer, Cloudera
Re: Addendum to Hypertable vs. HBase Performance Test (w/ mslab enabled)
As your numbers show. Dataset SizeHypertable Queries/s HBase Queries/s Hypertable Latency (ms)HBase Latency (ms) 0.5 TB 3256.42 2969.52 157.221 172.351 5 TB2450.01 2066.52 208.972 247.680 Raw data goes up. Read performance goes down. Latency goes up. You mentioned you loaded 1/2 trillion records of historical financial data. The operative word is historical. Your not doing 1/2 trillion writes every day. Most of the system that use structured log formats can write very fast (I am guessing that is what hypertable uses btw). DD writes very fast as well, but if you want acceptable read latency you are going to need a good RAM/disk ratio. Even at 0.5 TB 157.221ms is not a great read latency, so your ability to write fast has already outstripped your ability to read at a rate that could support say web application. (I come from a world of 1-5ms latency BTW). What application can you support with numbers like that? An email compliance system where you want to store a ton of data, but only plan of doing 1 search a day to make an auditor happy? :) This is why I say your going to end up needing about the same # of nodes because when it comes time to read this data having a machine with 4Tb of data and 24 GB ram is not going to cut it. You are right on a couple of fronts 1) being able to load data fast is good (can't argue with that) 2) If hbase can't load X entries that is bad I really can't imagine that hbase blows up and just stops accepting inserts at one point. You seem to say its happening and I don't have time to verify. But if you are at the point where you are getting 175ms random and 85 zipfan latency what are you proving that is already more data then a server can handle. http://en.wikipedia.org/wiki/Network_performance Users browsing the Internet feel that responses are instant when delays are less than 100 ms from click to response[11]. Latency and throughput together affect the perceived speed of a connection. However, the perceived performance of a connection can still vary widely, depending in part on the type of information transmitted and how it is used. On Fri, Feb 17, 2012 at 7:25 PM, Doug Judd d...@hypertable.com wrote: Hi Edward, The problem is that even if the workload is 5% write and 95% read, if you can't load the data, you need more machines. In the 167 billion insert test, HBase failed with *Concurrent mode failure* after 20% of the data was loaded. One of our customers has loaded 1/2 trillion records of historical financial market data on 16 machines. If you do the back-of-the-envelope calculation, it would take about 180 machines for HBase to load 1/2 trillion cells. That makes HBase 10X more expensive in terms of hardware, power consumption, and data center real estate. - Doug On Fri, Feb 17, 2012 at 3:58 PM, Edward Capriolo edlinuxg...@gmail.comwrote: I would almost agree with prospective. But their is a problem with 'java is slow' theory. The reason is that in a 100 percent write workload gc might be a factor. But in the real world people have to read data and read becomes disk bound as your data gets larger then memory. Unless C++ can make your disk spin faster then java It is a wash. Making a claim that your going to need more servers for java/hbase is bogus. To put it in prospective, if the workload is 5 % write and 95 % read you are probably going to need just the same amount of hardware. You might get some win on the read size because your custom caching could be more efficient in terms of object size in memory and other gc issues but it is not 2 or 3 to one. If a million writes fall into a hypertable forest but it take a billion years to read them back did the writes ever sync :) On Monday, February 13, 2012, Doug Judd d...@hypertable.com wrote: Hey Todd, Bulk loading isn't always an option when data is streaming in from a live application. Many big data use cases involve massive amounts of smaller items in the size range of 10-100 bytes, for example URLs, sensor readings, genome sequence reads, network traffic logs, etc. If HBase requires 2-3 times the amount of hardware to avoid *Concurrent mode failures*, then that makes HBase 2-3 times more expensive from the standpoint of hardware, power consumption, and datacenter real estate. What takes the most time is getting the core database mechanics right (we're going on 5 years now). Once the core database is stable, integration with applications such as Solr and others are short term projects. I believe that sooner or later, most engineers working in this space will come to the conclusion that Java is the wrong language for this kind of database application. At that point, folks on the HBase project will realize that they are five years behind. - Doug On Mon, Feb 13, 2012 at 11:33 AM, Todd Lipcon t...@cloudera.com wrote: Hey Doug, Want to also run a
Re: Processing small xml files
On Tue, Feb 14, 2012 at 10:56 AM, W.P. McNeill bill...@gmail.com wrote: I'm not sure what you mean by flat format here. In my scenario, I have an file input.xml that looks like this. myfile section value1/value /section section value2/value /section /myfile input.xml is a plain text file. Not a sequence file. If I read it with the XMLInputFormat my mapper gets called with (key, value) pairs that look like this: (, sectionvalue1/value/section) (, sectionvalue2/value/section) Where the keys are numerical offsets into the file. I then use this information to write a sequence file with these (key, value) pairs. So my Hadoop job that uses XMLInputFormat takes a text file as input and produces a sequence file as output. I don't know a rule of thumb for how many small files is too many. Maybe someone else on the list can chime in. I just know that when your throughput gets slow that's one possible cause to investigate. I need to install hadoop. Does this xmlinput format comes as part of the install? Can you please give me some pointers that would help me install hadoop and xmlinputformat if necessary?
Re: Processing small xml files
Hi Mohit, You can use Pig for processing XML files. PiggyBank has build in load function to load the XML files. Also you can specify pig.maxCombinedSplitSize and pig.splitCombination for efficient processing. On Sat, Feb 18, 2012 at 1:18 AM, Mohit Anchlia mohitanch...@gmail.com wrote: On Tue, Feb 14, 2012 at 10:56 AM, W.P. McNeill bill...@gmail.com wrote: I'm not sure what you mean by flat format here. In my scenario, I have an file input.xml that looks like this. myfile section value1/value /section section value2/value /section /myfile input.xml is a plain text file. Not a sequence file. If I read it with the XMLInputFormat my mapper gets called with (key, value) pairs that look like this: (, sectionvalue1/value/section) (, sectionvalue2/value/section) Where the keys are numerical offsets into the file. I then use this information to write a sequence file with these (key, value) pairs. So my Hadoop job that uses XMLInputFormat takes a text file as input and produces a sequence file as output. I don't know a rule of thumb for how many small files is too many. Maybe someone else on the list can chime in. I just know that when your throughput gets slow that's one possible cause to investigate. I need to install hadoop. Does this xmlinput format comes as part of the install? Can you please give me some pointers that would help me install hadoop and xmlinputformat if necessary? -- -- Srinivas srini...@cloudwick.com