Re: Hadoop and XML

Ted Yu Tue, 20 Jul 2010 09:51:34 -0700

I also added Peter's comment to the JIRA I logged:
https://issues.apache.org/jira/browse/HADOOP-6868


On Tue, Jul 20, 2010 at 9:38 AM, Ted Yu <[email protected]> wrote:

> So the correct call should be:
> String valueString = new String(valueText.getBytes(), 0,
> valueText.getLength(), "UTF-8");
>
> Cheers
>
>
> On Tue, Jul 20, 2010 at 9:23 AM, Jeff Bean <[email protected]> wrote:
>
>> data.length is the length of the byte array.
>>
>> Text.getLength() most likely returns a different value than
>> getBytes.length.
>>
>> Hadoop reuses box class objects like Text, so what it's probably doing is
>> writing over the byte array, lengthening it as necessary, and just
>> updating
>> a separate length attribute.
>>
>> Jeff
>>
>> On Tue, Jul 20, 2010 at 8:56 AM, Ted Yu <[email protected]> wrote:
>>
>> > Interesting.
>> > String class is able to handle this scenario:
>> >
>> >  348       public String(byte[] data, String encoding) throws
>> > UnsupportedEncodingException {
>> >  349           this(data, 0, data.length, encoding);
>> >  350       }
>> >
>> >
>> >
>> > On Tue, Jul 20, 2010 at 6:01 AM, Jeff Bean <[email protected]>
>> wrote:
>> >
>> > > I think the problem is here:
>> > >
>> > > String valueString = new String(valueText.getBytes(), "UTF-8");
>> > >
>> > > Javadoc for Text says:
>> > >
>> > > *getBytes<
>> > >
>> >
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getBytes%28%29
>> > > >
>> > > *()
>> > >          Returns the raw bytes; however, only data up to
>> > > getLength()<
>> > >
>> >
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getLength%28%29
>> > > >is
>> > > valid.
>> > >
>> > > So try getting the length, truncating the byte array at the value
>> > returned
>> > > by getLength() and THEN converting it to a String.
>> > >
>> > > Jeff
>> > >
>> > > On Mon, Jul 19, 2010 at 9:08 AM, Ted Yu <[email protected]> wrote:
>> > >
>> > > > For your initial question on Text.set().
>> > > > Text.setCapacity() allocates new byte array. Since keepData is
>> false,
>> > old
>> > > > data wouldn't be copied over.
>> > > >
>> > > > On Mon, Jul 19, 2010 at 8:01 AM, Peter Minearo <
>> > > > [email protected]> wrote:
>> > > >
>> > > > > I am already using XmlInputFormat.  The input into the Map phase
>> is
>> > not
>> > > > > the problem.  The problem lays in between the Map and Reduce
>> phase.
>> > > > >
>> > > > > BTW - The article is correct.  DO NOT USE StreamXmlRecordReader.
>> > > > > XmlInputFormat is a lot faster.  From my testing,
>> > StreamXmlRecordReader
>> > > > > took 8 minutes to read a 1 GB XML document; where as,
>> XmlInputFormat
>> > > was
>> > > > > under 2 minutes. (Using 2 Core, 8GB machines)
>> > > > >
>> > > > >
>> > > > > -----Original Message-----
>> > > > > From: Ted Yu [mailto:[email protected]]
>> > > > > Sent: Friday, July 16, 2010 9:44 PM
>> > > > > To: [email protected]
>> > > > > Subject: Re: Hadoop and XML
>> > > > >
>> > > > > From an earlier post:
>> > > > >
>> > http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html
>> > > > >
>> > > > > On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo <
>> > > > > [email protected]> wrote:
>> > > > >
>> > > > > > Moving the variable to a local variable did not seem to work:
>> > > > > >
>> > > > > >
>> > > > > > </PrivateRateSet>vateRateSet>
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > public void map(Object key, Object value, OutputCollector
>> output,
>> > > > > > Reporter
>> > > > > > reporter) throws IOException {
>> > > > > >                Text valueText = (Text)value;
>> > > > > >                String valueString = new
>> > String(valueText.getBytes(),
>> > > > > > "UTF-8");
>> > > > > >                String keyString = getXmlKey(valueString);
>> > > > > >                 Text returnKeyText = new Text();
>> > > > > >                Text returnValueText = new Text();
>> > > > > >                returnKeyText.set(keyString);
>> > > > > >                returnValueText.set(valueString);
>> > > > > >                output.collect(returnKeyText, returnValueText); }
>> > > > > >
>> > > > > > -----Original Message-----
>> > > > > > From: Peter Minearo [mailto:[email protected]]
>> > > > > > Sent: Fri 7/16/2010 2:51 PM
>> > > > > > To: [email protected]
>> > > > > > Subject: RE: Hadoop and XML
>> > > > > >
>> > > > > > Whoops....right after I sent it and someone else made a
>> suggestion;
>> > I
>> > > > > > realized what question 2 was about.  I can try that, but
>> wouldn't
>> > > that
>> > > > >
>> > > > > > cause Object bloat?  During the Hadoop training I went through;
>> it
>> > > was
>> > > > >
>> > > > > > mentioned to reuse the returning Key and Value objects to keep
>> the
>> > > > > > number of Objects created down to a minimum.  Is this not really
>> a
>> > > > > > valid point?
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > -----Original Message-----
>> > > > > > From: Peter Minearo [mailto:[email protected]]
>> > > > > > Sent: Friday, July 16, 2010 2:44 PM
>> > > > > > To: [email protected]
>> > > > > > Subject: RE: Hadoop and XML
>> > > > > >
>> > > > > >
>> > > > > > I am not using multi-threaded Map tasks.  Also, if I understand
>> > your
>> > > > > > second question correctly:
>> > > > > > "Also can you try creating the output key and values in the map
>> > > > > > method(method lacal) ?"
>> > > > > > In the first code snippet I am doing exactly that.
>> > > > > >
>> > > > > > Below is the class that runs the Job.
>> > > > > >
>> > > > > > public class HadoopJobClient {
>> > > > > >
>> > > > > >        private static final Log LOGGER =
>> > > > > > LogFactory.getLog(Prds.class.getName());
>> > > > > >
>> > > > > >        public static void main(String[] args) {
>> > > > > >                JobConf conf = new JobConf(Prds.class);
>> > > > > >
>> > > > > >                conf.set("xmlinput.start", "<PrivateRateSet>");
>> > > > > >                conf.set("xmlinput.end", "</PrivateRateSet>");
>> > > > > >
>> > > > > >                conf.setJobName("PRDS Parse");
>> > > > > >
>> > > > > >                conf.setOutputKeyClass(Text.class);
>> > > > > >                conf.setOutputValueClass(Text.class);
>> > > > > >
>> > > > > >                conf.setMapperClass(PrdsMapper.class);
>> > > > > >                conf.setReducerClass(PrdsReducer.class);
>> > > > > >
>> > > > > >                conf.setInputFormat(XmlInputFormat.class);
>> > > > > >                conf.setOutputFormat(TextOutputFormat.class);
>> > > > > >
>> > > > > >                FileInputFormat.setInputPaths(conf, new
>> > > Path(args[0]));
>> > > > > >                FileOutputFormat.setOutputPath(conf, new
>> > > > > > Path(args[1]));
>> > > > > >
>> > > > > >                // Run the job
>> > > > > >                try {
>> > > > > >                        JobClient.runJob(conf);
>> > > > > >                } catch (IOException e) {
>> > > > > >                        LOGGER.error(e.getMessage(), e);
>> > > > > >                }
>> > > > > >
>> > > > > >        }
>> > > > > >
>> > > > > >
>> > > > > > }
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > -----Original Message-----
>> > > > > > From: Soumya Banerjee [mailto:[email protected]]
>> > > > > > Sent: Fri 7/16/2010 2:29 PM
>> > > > > > To: [email protected]
>> > > > > > Subject: Re: Hadoop and XML
>> > > > > >
>> > > > > > Hi,
>> > > > > >
>> > > > > > Can you please share the code of the job submission client ?
>> > > > > >
>> > > > > > Also can you try creating the output key and values in the map
>> > > > > > method(method
>> > > > > > lacal) ?
>> > > > > > Make sure you are not using multi threaded map task
>> configuration.
>> > > > > >
>> > > > > > map()
>> > > > > > {
>> > > > > > private Text keyText = new Text();
>> > > > > >  private Text valueText = new Text();
>> > > > > >
>> > > > > > //rest of the code
>> > > > > > }
>> > > > > >
>> > > > > > Soumya.
>> > > > > >
>> > > > > > On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo <
>> > > > > > [email protected]> wrote:
>> > > > > >
>> > > > > > > I have an XML file that has sparse data in it.  I am running a
>> > > > > > > MapReduce Job that reads in an XML file, pulls out a Key from
>> > > within
>> > > > >
>> > > > > > > the XML snippet and then hands back the Key and the XML
>> snippet
>> > (as
>> > > > > > > the Value) to the OutputCollector.  The reason is to sort the
>> > file
>> > > > > > back into order.
>> > > > > > > Below is the snippet of code.
>> > > > > > >
>> > > > > > > public class XmlMapper extends MapReduceBase implements Mapper
>> {
>> > > > > > >
>> > > > > > >  private Text keyText = new Text();
>> > > > > > >  private Text valueText = new Text();
>> > > > > > >
>> > > > > > >  @SuppressWarnings("unchecked")
>> > > > > > >  public void map(Object key, Object value, OutputCollector
>> > output,
>> > > > > > > Reporter reporter) throws IOException {  Text valueText =
>> > > > > > > (Text)value;
>> > > > > >
>> > > > > > > String valueString = new String(valueText.getBytes(),
>> "UTF-8");
>> > > > > > > String keyString = getXmlKey(valueString);
>> > > > > > > getKeyText().set(keyString);  getValueText().set(valueString);
>> > > > > > > output.collect(getKeyText(), getValueText());  }
>> > > > > > >
>> > > > > > >
>> > > > > > >  public Text getKeyText() {
>> > > > > > >  return keyText;
>> > > > > > >  }
>> > > > > > >
>> > > > > > >
>> > > > > > >  public void setKeyText(Text keyText) {  this.keyText =
>> keyText;
>> >  }
>> > > > > > >
>> > > > > > >
>> > > > > > >  public Text getValueText() {
>> > > > > > >  return valueText;
>> > > > > > >  }
>> > > > > > >
>> > > > > > >
>> > > > > > >  public void setValueText(Text valueText) {  this.valueText =
>> > > > > > > valueText;  }
>> > > > > > >
>> > > > > > >
>> > > > > > >  private String getXmlKey(String value) {
>> > > > > > >        // Get the Key from the XML in the value.
>> > > > > > >  }
>> > > > > > >
>> > > > > > > }
>> > > > > > >
>> > > > > > > The XML snippet from the Value is fine when it is passed into
>> the
>> > > > > > > map() method.  I am not changing any data either, just pulling
>> > out
>> > > > > > > information for the key.  The problem I am seeing is between
>> the
>> > > Map
>> > > > >
>> > > > > > > phase and the Reduce phase, the XML is getting munged.  For
>> > > Example:
>> > > > > > >
>> > > > > > >  </PrivateRate>
>> > > > > > >  </PrivateRateSet>te>
>> > > > > > >
>> > > > > > > It is my understanding that Hadoop uses the same instance of
>> the
>> > > Key
>> > > > >
>> > > > > > > and Value object when calling the Map method.  What changes is
>> > the
>> > > > > > > data within those instances.  So, I ran an experiment where I
>> do
>> > > not
>> > > > >
>> > > > > > > have different Key or Value Text Objects.  I reuse the ones
>> > passed
>> > > > > > > into the method, like below:
>> > > > > > >
>> > > > > > > public class XmlMapper extends MapReduceBase implements Mapper
>> {
>> > > > > > >
>> > > > > > >  @SuppressWarnings("unchecked")
>> > > > > > >  public void map(Object key, Object value, OutputCollector
>> > output,
>> > > > > > > Reporter reporter) throws IOException {  Text keyText =
>> > (Text)key;
>> > > > > > > Text valueText = (Text)value;  String valueString = new
>> > > > > > > String(valueText.getBytes(), "UTF-8");  String keyString =
>> > > > > > > getXmlKey(valueString);  keyText.set(keyString);
>> > > > > > > valueText.set(valueString);  output.collect(keyText,
>> valueText);
>> >  }
>> > > > > > >
>> > > > > > >
>> > > > > > >  private String getXmlKey(String value) {
>> > > > > > >        // Get the Key from the XML in the value.
>> > > > > > >  }
>> > > > > > >
>> > > > > > > }
>> > > > > > >
>> > > > > > > What was interesting about this is the fact that the XML was
>> > > getting
>> > > > >
>> > > > > > > munged within the Map Phase.  When I changed over to the code
>> at
>> > > the
>> > > > >
>> > > > > > > top, the Map phase was fine.  However, the Reduce phase picks
>> up
>> > > the
>> > > > >
>> > > > > > > munged XML.  Trying to debug the problem, I came across this
>> > method
>> > > > > > > in
>> > > > > >
>> > > > > > > the Text Object:
>> > > > > > >
>> > > > > > > public void set(byte[] utf8, int start, int len) {
>> > > > > > >    setCapacity(len, false);
>> > > > > > >    System.arraycopy(utf8, start, bytes, 0, len);
>> > > > > > >    this.length = len;
>> > > > > > > }
>> > > > > > >
>> > > > > > > If the "bytes" array had a length of 1000 and the "utf8" array
>> > has
>> > > a
>> > > > >
>> > > > > > > length of 500; doing a System.arraycopy() would only copy the
>> > first
>> > > > > > > 500 from "utf8" to "bytes" but leave the last 500 in "bytes"
>> > alone.
>> > > > > > > Could this be the cause of the XML munging?
>> > > > > > >
>> > > > > > > All of this leads me to a few questions:
>> > > > > > >
>> > > > > > > 1) Has anyone successfully used XML snippets as the data
>> format
>> > > > > > > within
>> > > > > >
>> > > > > > > a MapReduce job; not just reading from the file but used
>> during
>> > the
>> > > > > > > shuffle?
>> > > > > > > 2) Is anyone seeing this problem with XML or any other format?
>> > > > > > > 3) Does anyone know what is going on?
>> > > > > > > 4) Is this a bug?
>> > > > > > >
>> > > > > > >
>> > > > > > > Thanks,
>> > > > > > >
>> > > > > > > Peter
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Hadoop and XML

Reply via email to