That is exacly what is happening. This is the code from the Text class.
public void set(String string) {
try {
ByteBuffer bb = encode(string, true);
bytes = bb.array();
length = bb.limit();
}catch(CharacterCodingException e) {
throw new RuntimeException("Should not have happened " + e.toString());
}
}
This sounds like a bug.
Let's say you create a Text object and drop in a String that sets the byte
array length to 200. Then drop in a a second String that sets the byte array
length to 500. Since, the new length is greater than the previous length; the
byte array length is reset to the longer length. Now, if you drop in a third
String that would set the byte array length to 350; the Text object does not
replace the byte array with a new length of 350; it utilizes the greater length
of 500 and sets an extra variable to track the "real" length.
So: Text.getBytes().length != Text.getLength()
This does 2 things:
1. Passes around more data than what is needed
2. Makes the Text object confusing to work with
Text.getBytes().length == Text.getLength() - should be the correct behavior.
-----Original Message-----
From: Jeff Bean [mailto:[email protected]]
Sent: Tue 7/20/2010 9:23 AM
To: [email protected]
Subject: Re: Hadoop and XML
data.length is the length of the byte array.
Text.getLength() most likely returns a different value than getBytes.length.
Hadoop reuses box class objects like Text, so what it's probably doing is
writing over the byte array, lengthening it as necessary, and just updating
a separate length attribute.
Jeff
On Tue, Jul 20, 2010 at 8:56 AM, Ted Yu <[email protected]> wrote:
> Interesting.
> String class is able to handle this scenario:
>
> 348 public String(byte[] data, String encoding) throws
> UnsupportedEncodingException {
> 349 this(data, 0, data.length, encoding);
> 350 }
>
>
>
> On Tue, Jul 20, 2010 at 6:01 AM, Jeff Bean <[email protected]> wrote:
>
> > I think the problem is here:
> >
> > String valueString = new String(valueText.getBytes(), "UTF-8");
> >
> > Javadoc for Text says:
> >
> > *getBytes<
> >
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getBytes%28%29
> > >
> > *()
> > Returns the raw bytes; however, only data up to
> > getLength()<
> >
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getLength%28%29
> > >is
> > valid.
> >
> > So try getting the length, truncating the byte array at the value
> returned
> > by getLength() and THEN converting it to a String.
> >
> > Jeff
> >
> > On Mon, Jul 19, 2010 at 9:08 AM, Ted Yu <[email protected]> wrote:
> >
> > > For your initial question on Text.set().
> > > Text.setCapacity() allocates new byte array. Since keepData is false,
> old
> > > data wouldn't be copied over.
> > >
> > > On Mon, Jul 19, 2010 at 8:01 AM, Peter Minearo <
> > > [email protected]> wrote:
> > >
> > > > I am already using XmlInputFormat. The input into the Map phase is
> not
> > > > the problem. The problem lays in between the Map and Reduce phase.
> > > >
> > > > BTW - The article is correct. DO NOT USE StreamXmlRecordReader.
> > > > XmlInputFormat is a lot faster. From my testing,
> StreamXmlRecordReader
> > > > took 8 minutes to read a 1 GB XML document; where as, XmlInputFormat
> > was
> > > > under 2 minutes. (Using 2 Core, 8GB machines)
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Ted Yu [mailto:[email protected]]
> > > > Sent: Friday, July 16, 2010 9:44 PM
> > > > To: [email protected]
> > > > Subject: Re: Hadoop and XML
> > > >
> > > > From an earlier post:
> > > >
> http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html
> > > >
> > > > On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo <
> > > > [email protected]> wrote:
> > > >
> > > > > Moving the variable to a local variable did not seem to work:
> > > > >
> > > > >
> > > > > </PrivateRateSet>vateRateSet>
> > > > >
> > > > >
> > > > >
> > > > > public void map(Object key, Object value, OutputCollector output,
> > > > > Reporter
> > > > > reporter) throws IOException {
> > > > > Text valueText = (Text)value;
> > > > > String valueString = new
> String(valueText.getBytes(),
> > > > > "UTF-8");
> > > > > String keyString = getXmlKey(valueString);
> > > > > Text returnKeyText = new Text();
> > > > > Text returnValueText = new Text();
> > > > > returnKeyText.set(keyString);
> > > > > returnValueText.set(valueString);
> > > > > output.collect(returnKeyText, returnValueText); }
> > > > >
> > > > > -----Original Message-----
> > > > > From: Peter Minearo [mailto:[email protected]]
> > > > > Sent: Fri 7/16/2010 2:51 PM
> > > > > To: [email protected]
> > > > > Subject: RE: Hadoop and XML
> > > > >
> > > > > Whoops....right after I sent it and someone else made a suggestion;
> I
> > > > > realized what question 2 was about. I can try that, but wouldn't
> > that
> > > >
> > > > > cause Object bloat? During the Hadoop training I went through; it
> > was
> > > >
> > > > > mentioned to reuse the returning Key and Value objects to keep the
> > > > > number of Objects created down to a minimum. Is this not really a
> > > > > valid point?
> > > > >
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: Peter Minearo [mailto:[email protected]]
> > > > > Sent: Friday, July 16, 2010 2:44 PM
> > > > > To: [email protected]
> > > > > Subject: RE: Hadoop and XML
> > > > >
> > > > >
> > > > > I am not using multi-threaded Map tasks. Also, if I understand
> your
> > > > > second question correctly:
> > > > > "Also can you try creating the output key and values in the map
> > > > > method(method lacal) ?"
> > > > > In the first code snippet I am doing exactly that.
> > > > >
> > > > > Below is the class that runs the Job.
> > > > >
> > > > > public class HadoopJobClient {
> > > > >
> > > > > private static final Log LOGGER =
> > > > > LogFactory.getLog(Prds.class.getName());
> > > > >
> > > > > public static void main(String[] args) {
> > > > > JobConf conf = new JobConf(Prds.class);
> > > > >
> > > > > conf.set("xmlinput.start", "<PrivateRateSet>");
> > > > > conf.set("xmlinput.end", "</PrivateRateSet>");
> > > > >
> > > > > conf.setJobName("PRDS Parse");
> > > > >
> > > > > conf.setOutputKeyClass(Text.class);
> > > > > conf.setOutputValueClass(Text.class);
> > > > >
> > > > > conf.setMapperClass(PrdsMapper.class);
> > > > > conf.setReducerClass(PrdsReducer.class);
> > > > >
> > > > > conf.setInputFormat(XmlInputFormat.class);
> > > > > conf.setOutputFormat(TextOutputFormat.class);
> > > > >
> > > > > FileInputFormat.setInputPaths(conf, new
> > Path(args[0]));
> > > > > FileOutputFormat.setOutputPath(conf, new
> > > > > Path(args[1]));
> > > > >
> > > > > // Run the job
> > > > > try {
> > > > > JobClient.runJob(conf);
> > > > > } catch (IOException e) {
> > > > > LOGGER.error(e.getMessage(), e);
> > > > > }
> > > > >
> > > > > }
> > > > >
> > > > >
> > > > > }
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: Soumya Banerjee [mailto:[email protected]]
> > > > > Sent: Fri 7/16/2010 2:29 PM
> > > > > To: [email protected]
> > > > > Subject: Re: Hadoop and XML
> > > > >
> > > > > Hi,
> > > > >
> > > > > Can you please share the code of the job submission client ?
> > > > >
> > > > > Also can you try creating the output key and values in the map
> > > > > method(method
> > > > > lacal) ?
> > > > > Make sure you are not using multi threaded map task configuration.
> > > > >
> > > > > map()
> > > > > {
> > > > > private Text keyText = new Text();
> > > > > private Text valueText = new Text();
> > > > >
> > > > > //rest of the code
> > > > > }
> > > > >
> > > > > Soumya.
> > > > >
> > > > > On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo <
> > > > > [email protected]> wrote:
> > > > >
> > > > > > I have an XML file that has sparse data in it. I am running a
> > > > > > MapReduce Job that reads in an XML file, pulls out a Key from
> > within
> > > >
> > > > > > the XML snippet and then hands back the Key and the XML snippet
> (as
> > > > > > the Value) to the OutputCollector. The reason is to sort the
> file
> > > > > back into order.
> > > > > > Below is the snippet of code.
> > > > > >
> > > > > > public class XmlMapper extends MapReduceBase implements Mapper {
> > > > > >
> > > > > > private Text keyText = new Text();
> > > > > > private Text valueText = new Text();
> > > > > >
> > > > > > @SuppressWarnings("unchecked")
> > > > > > public void map(Object key, Object value, OutputCollector
> output,
> > > > > > Reporter reporter) throws IOException { Text valueText =
> > > > > > (Text)value;
> > > > >
> > > > > > String valueString = new String(valueText.getBytes(), "UTF-8");
> > > > > > String keyString = getXmlKey(valueString);
> > > > > > getKeyText().set(keyString); getValueText().set(valueString);
> > > > > > output.collect(getKeyText(), getValueText()); }
> > > > > >
> > > > > >
> > > > > > public Text getKeyText() {
> > > > > > return keyText;
> > > > > > }
> > > > > >
> > > > > >
> > > > > > public void setKeyText(Text keyText) { this.keyText = keyText;
> }
> > > > > >
> > > > > >
> > > > > > public Text getValueText() {
> > > > > > return valueText;
> > > > > > }
> > > > > >
> > > > > >
> > > > > > public void setValueText(Text valueText) { this.valueText =
> > > > > > valueText; }
> > > > > >
> > > > > >
> > > > > > private String getXmlKey(String value) {
> > > > > > // Get the Key from the XML in the value.
> > > > > > }
> > > > > >
> > > > > > }
> > > > > >
> > > > > > The XML snippet from the Value is fine when it is passed into the
> > > > > > map() method. I am not changing any data either, just pulling
> out
> > > > > > information for the key. The problem I am seeing is between the
> > Map
> > > >
> > > > > > phase and the Reduce phase, the XML is getting munged. For
> > Example:
> > > > > >
> > > > > > </PrivateRate>
> > > > > > </PrivateRateSet>te>
> > > > > >
> > > > > > It is my understanding that Hadoop uses the same instance of the
> > Key
> > > >
> > > > > > and Value object when calling the Map method. What changes is
> the
> > > > > > data within those instances. So, I ran an experiment where I do
> > not
> > > >
> > > > > > have different Key or Value Text Objects. I reuse the ones
> passed
> > > > > > into the method, like below:
> > > > > >
> > > > > > public class XmlMapper extends MapReduceBase implements Mapper {
> > > > > >
> > > > > > @SuppressWarnings("unchecked")
> > > > > > public void map(Object key, Object value, OutputCollector
> output,
> > > > > > Reporter reporter) throws IOException { Text keyText =
> (Text)key;
> > > > > > Text valueText = (Text)value; String valueString = new
> > > > > > String(valueText.getBytes(), "UTF-8"); String keyString =
> > > > > > getXmlKey(valueString); keyText.set(keyString);
> > > > > > valueText.set(valueString); output.collect(keyText, valueText);
> }
> > > > > >
> > > > > >
> > > > > > private String getXmlKey(String value) {
> > > > > > // Get the Key from the XML in the value.
> > > > > > }
> > > > > >
> > > > > > }
> > > > > >
> > > > > > What was interesting about this is the fact that the XML was
> > getting
> > > >
> > > > > > munged within the Map Phase. When I changed over to the code at
> > the
> > > >
> > > > > > top, the Map phase was fine. However, the Reduce phase picks up
> > the
> > > >
> > > > > > munged XML. Trying to debug the problem, I came across this
> method
> > > > > > in
> > > > >
> > > > > > the Text Object:
> > > > > >
> > > > > > public void set(byte[] utf8, int start, int len) {
> > > > > > setCapacity(len, false);
> > > > > > System.arraycopy(utf8, start, bytes, 0, len);
> > > > > > this.length = len;
> > > > > > }
> > > > > >
> > > > > > If the "bytes" array had a length of 1000 and the "utf8" array
> has
> > a
> > > >
> > > > > > length of 500; doing a System.arraycopy() would only copy the
> first
> > > > > > 500 from "utf8" to "bytes" but leave the last 500 in "bytes"
> alone.
> > > > > > Could this be the cause of the XML munging?
> > > > > >
> > > > > > All of this leads me to a few questions:
> > > > > >
> > > > > > 1) Has anyone successfully used XML snippets as the data format
> > > > > > within
> > > > >
> > > > > > a MapReduce job; not just reading from the file but used during
> the
> > > > > > shuffle?
> > > > > > 2) Is anyone seeing this problem with XML or any other format?
> > > > > > 3) Does anyone know what is going on?
> > > > > > 4) Is this a bug?
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Peter
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>