I am already using XmlInputFormat. The input into the Map phase is not the problem. The problem lays in between the Map and Reduce phase.
BTW - The article is correct. DO NOT USE StreamXmlRecordReader. XmlInputFormat is a lot faster. From my testing, StreamXmlRecordReader took 8 minutes to read a 1 GB XML document; where as, XmlInputFormat was under 2 minutes. (Using 2 Core, 8GB machines) -----Original Message----- From: Ted Yu [mailto:[email protected]] Sent: Friday, July 16, 2010 9:44 PM To: [email protected] Subject: Re: Hadoop and XML >From an earlier post: http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo < [email protected]> wrote: > Moving the variable to a local variable did not seem to work: > > > </PrivateRateSet>vateRateSet> > > > > public void map(Object key, Object value, OutputCollector output, > Reporter > reporter) throws IOException { > Text valueText = (Text)value; > String valueString = new String(valueText.getBytes(), > "UTF-8"); > String keyString = getXmlKey(valueString); > Text returnKeyText = new Text(); > Text returnValueText = new Text(); > returnKeyText.set(keyString); > returnValueText.set(valueString); > output.collect(returnKeyText, returnValueText); } > > -----Original Message----- > From: Peter Minearo [mailto:[email protected]] > Sent: Fri 7/16/2010 2:51 PM > To: [email protected] > Subject: RE: Hadoop and XML > > Whoops....right after I sent it and someone else made a suggestion; I > realized what question 2 was about. I can try that, but wouldn't that > cause Object bloat? During the Hadoop training I went through; it was > mentioned to reuse the returning Key and Value objects to keep the > number of Objects created down to a minimum. Is this not really a > valid point? > > > > -----Original Message----- > From: Peter Minearo [mailto:[email protected]] > Sent: Friday, July 16, 2010 2:44 PM > To: [email protected] > Subject: RE: Hadoop and XML > > > I am not using multi-threaded Map tasks. Also, if I understand your > second question correctly: > "Also can you try creating the output key and values in the map > method(method lacal) ?" > In the first code snippet I am doing exactly that. > > Below is the class that runs the Job. > > public class HadoopJobClient { > > private static final Log LOGGER = > LogFactory.getLog(Prds.class.getName()); > > public static void main(String[] args) { > JobConf conf = new JobConf(Prds.class); > > conf.set("xmlinput.start", "<PrivateRateSet>"); > conf.set("xmlinput.end", "</PrivateRateSet>"); > > conf.setJobName("PRDS Parse"); > > conf.setOutputKeyClass(Text.class); > conf.setOutputValueClass(Text.class); > > conf.setMapperClass(PrdsMapper.class); > conf.setReducerClass(PrdsReducer.class); > > conf.setInputFormat(XmlInputFormat.class); > conf.setOutputFormat(TextOutputFormat.class); > > FileInputFormat.setInputPaths(conf, new Path(args[0])); > FileOutputFormat.setOutputPath(conf, new > Path(args[1])); > > // Run the job > try { > JobClient.runJob(conf); > } catch (IOException e) { > LOGGER.error(e.getMessage(), e); > } > > } > > > } > > > > > -----Original Message----- > From: Soumya Banerjee [mailto:[email protected]] > Sent: Fri 7/16/2010 2:29 PM > To: [email protected] > Subject: Re: Hadoop and XML > > Hi, > > Can you please share the code of the job submission client ? > > Also can you try creating the output key and values in the map > method(method > lacal) ? > Make sure you are not using multi threaded map task configuration. > > map() > { > private Text keyText = new Text(); > private Text valueText = new Text(); > > //rest of the code > } > > Soumya. > > On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo < > [email protected]> wrote: > > > I have an XML file that has sparse data in it. I am running a > > MapReduce Job that reads in an XML file, pulls out a Key from within > > the XML snippet and then hands back the Key and the XML snippet (as > > the Value) to the OutputCollector. The reason is to sort the file > back into order. > > Below is the snippet of code. > > > > public class XmlMapper extends MapReduceBase implements Mapper { > > > > private Text keyText = new Text(); > > private Text valueText = new Text(); > > > > @SuppressWarnings("unchecked") > > public void map(Object key, Object value, OutputCollector output, > > Reporter reporter) throws IOException { Text valueText = > > (Text)value; > > > String valueString = new String(valueText.getBytes(), "UTF-8"); > > String keyString = getXmlKey(valueString); > > getKeyText().set(keyString); getValueText().set(valueString); > > output.collect(getKeyText(), getValueText()); } > > > > > > public Text getKeyText() { > > return keyText; > > } > > > > > > public void setKeyText(Text keyText) { this.keyText = keyText; } > > > > > > public Text getValueText() { > > return valueText; > > } > > > > > > public void setValueText(Text valueText) { this.valueText = > > valueText; } > > > > > > private String getXmlKey(String value) { > > // Get the Key from the XML in the value. > > } > > > > } > > > > The XML snippet from the Value is fine when it is passed into the > > map() method. I am not changing any data either, just pulling out > > information for the key. The problem I am seeing is between the Map > > phase and the Reduce phase, the XML is getting munged. For Example: > > > > </PrivateRate> > > </PrivateRateSet>te> > > > > It is my understanding that Hadoop uses the same instance of the Key > > and Value object when calling the Map method. What changes is the > > data within those instances. So, I ran an experiment where I do not > > have different Key or Value Text Objects. I reuse the ones passed > > into the method, like below: > > > > public class XmlMapper extends MapReduceBase implements Mapper { > > > > @SuppressWarnings("unchecked") > > public void map(Object key, Object value, OutputCollector output, > > Reporter reporter) throws IOException { Text keyText = (Text)key; > > Text valueText = (Text)value; String valueString = new > > String(valueText.getBytes(), "UTF-8"); String keyString = > > getXmlKey(valueString); keyText.set(keyString); > > valueText.set(valueString); output.collect(keyText, valueText); } > > > > > > private String getXmlKey(String value) { > > // Get the Key from the XML in the value. > > } > > > > } > > > > What was interesting about this is the fact that the XML was getting > > munged within the Map Phase. When I changed over to the code at the > > top, the Map phase was fine. However, the Reduce phase picks up the > > munged XML. Trying to debug the problem, I came across this method > > in > > > the Text Object: > > > > public void set(byte[] utf8, int start, int len) { > > setCapacity(len, false); > > System.arraycopy(utf8, start, bytes, 0, len); > > this.length = len; > > } > > > > If the "bytes" array had a length of 1000 and the "utf8" array has a > > length of 500; doing a System.arraycopy() would only copy the first > > 500 from "utf8" to "bytes" but leave the last 500 in "bytes" alone. > > Could this be the cause of the XML munging? > > > > All of this leads me to a few questions: > > > > 1) Has anyone successfully used XML snippets as the data format > > within > > > a MapReduce job; not just reading from the file but used during the > > shuffle? > > 2) Is anyone seeing this problem with XML or any other format? > > 3) Does anyone know what is going on? > > 4) Is this a bug? > > > > > > Thanks, > > > > Peter > > > > > > > > >
