I also added Peter's comment to the JIRA I logged: https://issues.apache.org/jira/browse/HADOOP-6868
On Tue, Jul 20, 2010 at 9:38 AM, Ted Yu <[email protected]> wrote: > So the correct call should be: > String valueString = new String(valueText.getBytes(), 0, > valueText.getLength(), "UTF-8"); > > Cheers > > > On Tue, Jul 20, 2010 at 9:23 AM, Jeff Bean <[email protected]> wrote: > >> data.length is the length of the byte array. >> >> Text.getLength() most likely returns a different value than >> getBytes.length. >> >> Hadoop reuses box class objects like Text, so what it's probably doing is >> writing over the byte array, lengthening it as necessary, and just >> updating >> a separate length attribute. >> >> Jeff >> >> On Tue, Jul 20, 2010 at 8:56 AM, Ted Yu <[email protected]> wrote: >> >> > Interesting. >> > String class is able to handle this scenario: >> > >> > 348 public String(byte[] data, String encoding) throws >> > UnsupportedEncodingException { >> > 349 this(data, 0, data.length, encoding); >> > 350 } >> > >> > >> > >> > On Tue, Jul 20, 2010 at 6:01 AM, Jeff Bean <[email protected]> >> wrote: >> > >> > > I think the problem is here: >> > > >> > > String valueString = new String(valueText.getBytes(), "UTF-8"); >> > > >> > > Javadoc for Text says: >> > > >> > > *getBytes< >> > > >> > >> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getBytes%28%29 >> > > > >> > > *() >> > > Returns the raw bytes; however, only data up to >> > > getLength()< >> > > >> > >> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getLength%28%29 >> > > >is >> > > valid. >> > > >> > > So try getting the length, truncating the byte array at the value >> > returned >> > > by getLength() and THEN converting it to a String. >> > > >> > > Jeff >> > > >> > > On Mon, Jul 19, 2010 at 9:08 AM, Ted Yu <[email protected]> wrote: >> > > >> > > > For your initial question on Text.set(). >> > > > Text.setCapacity() allocates new byte array. Since keepData is >> false, >> > old >> > > > data wouldn't be copied over. >> > > > >> > > > On Mon, Jul 19, 2010 at 8:01 AM, Peter Minearo < >> > > > [email protected]> wrote: >> > > > >> > > > > I am already using XmlInputFormat. The input into the Map phase >> is >> > not >> > > > > the problem. The problem lays in between the Map and Reduce >> phase. >> > > > > >> > > > > BTW - The article is correct. DO NOT USE StreamXmlRecordReader. >> > > > > XmlInputFormat is a lot faster. From my testing, >> > StreamXmlRecordReader >> > > > > took 8 minutes to read a 1 GB XML document; where as, >> XmlInputFormat >> > > was >> > > > > under 2 minutes. (Using 2 Core, 8GB machines) >> > > > > >> > > > > >> > > > > -----Original Message----- >> > > > > From: Ted Yu [mailto:[email protected]] >> > > > > Sent: Friday, July 16, 2010 9:44 PM >> > > > > To: [email protected] >> > > > > Subject: Re: Hadoop and XML >> > > > > >> > > > > From an earlier post: >> > > > > >> > http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html >> > > > > >> > > > > On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo < >> > > > > [email protected]> wrote: >> > > > > >> > > > > > Moving the variable to a local variable did not seem to work: >> > > > > > >> > > > > > >> > > > > > </PrivateRateSet>vateRateSet> >> > > > > > >> > > > > > >> > > > > > >> > > > > > public void map(Object key, Object value, OutputCollector >> output, >> > > > > > Reporter >> > > > > > reporter) throws IOException { >> > > > > > Text valueText = (Text)value; >> > > > > > String valueString = new >> > String(valueText.getBytes(), >> > > > > > "UTF-8"); >> > > > > > String keyString = getXmlKey(valueString); >> > > > > > Text returnKeyText = new Text(); >> > > > > > Text returnValueText = new Text(); >> > > > > > returnKeyText.set(keyString); >> > > > > > returnValueText.set(valueString); >> > > > > > output.collect(returnKeyText, returnValueText); } >> > > > > > >> > > > > > -----Original Message----- >> > > > > > From: Peter Minearo [mailto:[email protected]] >> > > > > > Sent: Fri 7/16/2010 2:51 PM >> > > > > > To: [email protected] >> > > > > > Subject: RE: Hadoop and XML >> > > > > > >> > > > > > Whoops....right after I sent it and someone else made a >> suggestion; >> > I >> > > > > > realized what question 2 was about. I can try that, but >> wouldn't >> > > that >> > > > > >> > > > > > cause Object bloat? During the Hadoop training I went through; >> it >> > > was >> > > > > >> > > > > > mentioned to reuse the returning Key and Value objects to keep >> the >> > > > > > number of Objects created down to a minimum. Is this not really >> a >> > > > > > valid point? >> > > > > > >> > > > > > >> > > > > > >> > > > > > -----Original Message----- >> > > > > > From: Peter Minearo [mailto:[email protected]] >> > > > > > Sent: Friday, July 16, 2010 2:44 PM >> > > > > > To: [email protected] >> > > > > > Subject: RE: Hadoop and XML >> > > > > > >> > > > > > >> > > > > > I am not using multi-threaded Map tasks. Also, if I understand >> > your >> > > > > > second question correctly: >> > > > > > "Also can you try creating the output key and values in the map >> > > > > > method(method lacal) ?" >> > > > > > In the first code snippet I am doing exactly that. >> > > > > > >> > > > > > Below is the class that runs the Job. >> > > > > > >> > > > > > public class HadoopJobClient { >> > > > > > >> > > > > > private static final Log LOGGER = >> > > > > > LogFactory.getLog(Prds.class.getName()); >> > > > > > >> > > > > > public static void main(String[] args) { >> > > > > > JobConf conf = new JobConf(Prds.class); >> > > > > > >> > > > > > conf.set("xmlinput.start", "<PrivateRateSet>"); >> > > > > > conf.set("xmlinput.end", "</PrivateRateSet>"); >> > > > > > >> > > > > > conf.setJobName("PRDS Parse"); >> > > > > > >> > > > > > conf.setOutputKeyClass(Text.class); >> > > > > > conf.setOutputValueClass(Text.class); >> > > > > > >> > > > > > conf.setMapperClass(PrdsMapper.class); >> > > > > > conf.setReducerClass(PrdsReducer.class); >> > > > > > >> > > > > > conf.setInputFormat(XmlInputFormat.class); >> > > > > > conf.setOutputFormat(TextOutputFormat.class); >> > > > > > >> > > > > > FileInputFormat.setInputPaths(conf, new >> > > Path(args[0])); >> > > > > > FileOutputFormat.setOutputPath(conf, new >> > > > > > Path(args[1])); >> > > > > > >> > > > > > // Run the job >> > > > > > try { >> > > > > > JobClient.runJob(conf); >> > > > > > } catch (IOException e) { >> > > > > > LOGGER.error(e.getMessage(), e); >> > > > > > } >> > > > > > >> > > > > > } >> > > > > > >> > > > > > >> > > > > > } >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > -----Original Message----- >> > > > > > From: Soumya Banerjee [mailto:[email protected]] >> > > > > > Sent: Fri 7/16/2010 2:29 PM >> > > > > > To: [email protected] >> > > > > > Subject: Re: Hadoop and XML >> > > > > > >> > > > > > Hi, >> > > > > > >> > > > > > Can you please share the code of the job submission client ? >> > > > > > >> > > > > > Also can you try creating the output key and values in the map >> > > > > > method(method >> > > > > > lacal) ? >> > > > > > Make sure you are not using multi threaded map task >> configuration. >> > > > > > >> > > > > > map() >> > > > > > { >> > > > > > private Text keyText = new Text(); >> > > > > > private Text valueText = new Text(); >> > > > > > >> > > > > > //rest of the code >> > > > > > } >> > > > > > >> > > > > > Soumya. >> > > > > > >> > > > > > On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo < >> > > > > > [email protected]> wrote: >> > > > > > >> > > > > > > I have an XML file that has sparse data in it. I am running a >> > > > > > > MapReduce Job that reads in an XML file, pulls out a Key from >> > > within >> > > > > >> > > > > > > the XML snippet and then hands back the Key and the XML >> snippet >> > (as >> > > > > > > the Value) to the OutputCollector. The reason is to sort the >> > file >> > > > > > back into order. >> > > > > > > Below is the snippet of code. >> > > > > > > >> > > > > > > public class XmlMapper extends MapReduceBase implements Mapper >> { >> > > > > > > >> > > > > > > private Text keyText = new Text(); >> > > > > > > private Text valueText = new Text(); >> > > > > > > >> > > > > > > @SuppressWarnings("unchecked") >> > > > > > > public void map(Object key, Object value, OutputCollector >> > output, >> > > > > > > Reporter reporter) throws IOException { Text valueText = >> > > > > > > (Text)value; >> > > > > > >> > > > > > > String valueString = new String(valueText.getBytes(), >> "UTF-8"); >> > > > > > > String keyString = getXmlKey(valueString); >> > > > > > > getKeyText().set(keyString); getValueText().set(valueString); >> > > > > > > output.collect(getKeyText(), getValueText()); } >> > > > > > > >> > > > > > > >> > > > > > > public Text getKeyText() { >> > > > > > > return keyText; >> > > > > > > } >> > > > > > > >> > > > > > > >> > > > > > > public void setKeyText(Text keyText) { this.keyText = >> keyText; >> > } >> > > > > > > >> > > > > > > >> > > > > > > public Text getValueText() { >> > > > > > > return valueText; >> > > > > > > } >> > > > > > > >> > > > > > > >> > > > > > > public void setValueText(Text valueText) { this.valueText = >> > > > > > > valueText; } >> > > > > > > >> > > > > > > >> > > > > > > private String getXmlKey(String value) { >> > > > > > > // Get the Key from the XML in the value. >> > > > > > > } >> > > > > > > >> > > > > > > } >> > > > > > > >> > > > > > > The XML snippet from the Value is fine when it is passed into >> the >> > > > > > > map() method. I am not changing any data either, just pulling >> > out >> > > > > > > information for the key. The problem I am seeing is between >> the >> > > Map >> > > > > >> > > > > > > phase and the Reduce phase, the XML is getting munged. For >> > > Example: >> > > > > > > >> > > > > > > </PrivateRate> >> > > > > > > </PrivateRateSet>te> >> > > > > > > >> > > > > > > It is my understanding that Hadoop uses the same instance of >> the >> > > Key >> > > > > >> > > > > > > and Value object when calling the Map method. What changes is >> > the >> > > > > > > data within those instances. So, I ran an experiment where I >> do >> > > not >> > > > > >> > > > > > > have different Key or Value Text Objects. I reuse the ones >> > passed >> > > > > > > into the method, like below: >> > > > > > > >> > > > > > > public class XmlMapper extends MapReduceBase implements Mapper >> { >> > > > > > > >> > > > > > > @SuppressWarnings("unchecked") >> > > > > > > public void map(Object key, Object value, OutputCollector >> > output, >> > > > > > > Reporter reporter) throws IOException { Text keyText = >> > (Text)key; >> > > > > > > Text valueText = (Text)value; String valueString = new >> > > > > > > String(valueText.getBytes(), "UTF-8"); String keyString = >> > > > > > > getXmlKey(valueString); keyText.set(keyString); >> > > > > > > valueText.set(valueString); output.collect(keyText, >> valueText); >> > } >> > > > > > > >> > > > > > > >> > > > > > > private String getXmlKey(String value) { >> > > > > > > // Get the Key from the XML in the value. >> > > > > > > } >> > > > > > > >> > > > > > > } >> > > > > > > >> > > > > > > What was interesting about this is the fact that the XML was >> > > getting >> > > > > >> > > > > > > munged within the Map Phase. When I changed over to the code >> at >> > > the >> > > > > >> > > > > > > top, the Map phase was fine. However, the Reduce phase picks >> up >> > > the >> > > > > >> > > > > > > munged XML. Trying to debug the problem, I came across this >> > method >> > > > > > > in >> > > > > > >> > > > > > > the Text Object: >> > > > > > > >> > > > > > > public void set(byte[] utf8, int start, int len) { >> > > > > > > setCapacity(len, false); >> > > > > > > System.arraycopy(utf8, start, bytes, 0, len); >> > > > > > > this.length = len; >> > > > > > > } >> > > > > > > >> > > > > > > If the "bytes" array had a length of 1000 and the "utf8" array >> > has >> > > a >> > > > > >> > > > > > > length of 500; doing a System.arraycopy() would only copy the >> > first >> > > > > > > 500 from "utf8" to "bytes" but leave the last 500 in "bytes" >> > alone. >> > > > > > > Could this be the cause of the XML munging? >> > > > > > > >> > > > > > > All of this leads me to a few questions: >> > > > > > > >> > > > > > > 1) Has anyone successfully used XML snippets as the data >> format >> > > > > > > within >> > > > > > >> > > > > > > a MapReduce job; not just reading from the file but used >> during >> > the >> > > > > > > shuffle? >> > > > > > > 2) Is anyone seeing this problem with XML or any other format? >> > > > > > > 3) Does anyone know what is going on? >> > > > > > > 4) Is this a bug? >> > > > > > > >> > > > > > > >> > > > > > > Thanks, >> > > > > > > >> > > > > > > Peter >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > >
