RE: Hadoop and XML

Peter Minearo Mon, 19 Jul 2010 08:02:07 -0700

I am already using XmlInputFormat.  The input into the Map phase is not
the problem.  The problem lays in between the Map and Reduce phase.


BTW - The article is correct.  DO NOT USE StreamXmlRecordReader.
XmlInputFormat is a lot faster.  From my testing, StreamXmlRecordReader
took 8 minutes to read a 1 GB XML document; where as, XmlInputFormat was
under 2 minutes. (Using 2 Core, 8GB machines)
 

-----Original Message-----
From: Ted Yu [mailto:[email protected]] 
Sent: Friday, July 16, 2010 9:44 PM
To: [email protected]
Subject: Re: Hadoop and XML

>From an earlier post:
http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html

On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo <
[email protected]> wrote:

> Moving the variable to a local variable did not seem to work:
>
>
> </PrivateRateSet>vateRateSet>
>
>
>
> public void map(Object key, Object value, OutputCollector output, 
> Reporter
> reporter) throws IOException {
>                Text valueText = (Text)value;
>                String valueString = new String(valueText.getBytes(), 
> "UTF-8");
>                String keyString = getXmlKey(valueString);
>                 Text returnKeyText = new Text();
>                Text returnValueText = new Text();
>                returnKeyText.set(keyString);
>                returnValueText.set(valueString);
>                output.collect(returnKeyText, returnValueText); }
>
> -----Original Message-----
> From: Peter Minearo [mailto:[email protected]]
> Sent: Fri 7/16/2010 2:51 PM
> To: [email protected]
> Subject: RE: Hadoop and XML
>
> Whoops....right after I sent it and someone else made a suggestion; I 
> realized what question 2 was about.  I can try that, but wouldn't that

> cause Object bloat?  During the Hadoop training I went through; it was

> mentioned to reuse the returning Key and Value objects to keep the 
> number of Objects created down to a minimum.  Is this not really a 
> valid point?
>
>
>
> -----Original Message-----
> From: Peter Minearo [mailto:[email protected]]
> Sent: Friday, July 16, 2010 2:44 PM
> To: [email protected]
> Subject: RE: Hadoop and XML
>
>
> I am not using multi-threaded Map tasks.  Also, if I understand your 
> second question correctly:
> "Also can you try creating the output key and values in the map 
> method(method lacal) ?"
> In the first code snippet I am doing exactly that.
>
> Below is the class that runs the Job.
>
> public class HadoopJobClient {
>
>        private static final Log LOGGER = 
> LogFactory.getLog(Prds.class.getName());
>
>        public static void main(String[] args) {
>                JobConf conf = new JobConf(Prds.class);
>
>                conf.set("xmlinput.start", "<PrivateRateSet>");
>                conf.set("xmlinput.end", "</PrivateRateSet>");
>
>                conf.setJobName("PRDS Parse");
>
>                conf.setOutputKeyClass(Text.class);
>                conf.setOutputValueClass(Text.class);
>
>                conf.setMapperClass(PrdsMapper.class);
>                conf.setReducerClass(PrdsReducer.class);
>
>                conf.setInputFormat(XmlInputFormat.class);
>                conf.setOutputFormat(TextOutputFormat.class);
>
>                FileInputFormat.setInputPaths(conf, new Path(args[0]));
>                FileOutputFormat.setOutputPath(conf, new 
> Path(args[1]));
>
>                // Run the job
>                try {
>                        JobClient.runJob(conf);
>                } catch (IOException e) {
>                        LOGGER.error(e.getMessage(), e);
>                }
>
>        }
>
>
> }
>
>
>
>
> -----Original Message-----
> From: Soumya Banerjee [mailto:[email protected]]
> Sent: Fri 7/16/2010 2:29 PM
> To: [email protected]
> Subject: Re: Hadoop and XML
>
> Hi,
>
> Can you please share the code of the job submission client ?
>
> Also can you try creating the output key and values in the map 
> method(method
> lacal) ?
> Make sure you are not using multi threaded map task configuration.
>
> map()
> {
> private Text keyText = new Text();
>  private Text valueText = new Text();
>
> //rest of the code
> }
>
> Soumya.
>
> On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo < 
> [email protected]> wrote:
>
> > I have an XML file that has sparse data in it.  I am running a 
> > MapReduce Job that reads in an XML file, pulls out a Key from within

> > the XML snippet and then hands back the Key and the XML snippet (as 
> > the Value) to the OutputCollector.  The reason is to sort the file
> back into order.
> > Below is the snippet of code.
> >
> > public class XmlMapper extends MapReduceBase implements Mapper {
> >
> >  private Text keyText = new Text();
> >  private Text valueText = new Text();
> >
> >  @SuppressWarnings("unchecked")
> >  public void map(Object key, Object value, OutputCollector output, 
> > Reporter reporter) throws IOException {  Text valueText = 
> > (Text)value;
>
> > String valueString = new String(valueText.getBytes(), "UTF-8"); 
> > String keyString = getXmlKey(valueString); 
> > getKeyText().set(keyString);  getValueText().set(valueString); 
> > output.collect(getKeyText(), getValueText());  }
> >
> >
> >  public Text getKeyText() {
> >  return keyText;
> >  }
> >
> >
> >  public void setKeyText(Text keyText) {  this.keyText = keyText;  }
> >
> >
> >  public Text getValueText() {
> >  return valueText;
> >  }
> >
> >
> >  public void setValueText(Text valueText) {  this.valueText = 
> > valueText;  }
> >
> >
> >  private String getXmlKey(String value) {
> >        // Get the Key from the XML in the value.
> >  }
> >
> > }
> >
> > The XML snippet from the Value is fine when it is passed into the
> > map() method.  I am not changing any data either, just pulling out 
> > information for the key.  The problem I am seeing is between the Map

> > phase and the Reduce phase, the XML is getting munged.  For Example:
> >
> >  </PrivateRate>
> >  </PrivateRateSet>te>
> >
> > It is my understanding that Hadoop uses the same instance of the Key

> > and Value object when calling the Map method.  What changes is the 
> > data within those instances.  So, I ran an experiment where I do not

> > have different Key or Value Text Objects.  I reuse the ones passed 
> > into the method, like below:
> >
> > public class XmlMapper extends MapReduceBase implements Mapper {
> >
> >  @SuppressWarnings("unchecked")
> >  public void map(Object key, Object value, OutputCollector output, 
> > Reporter reporter) throws IOException {  Text keyText = (Text)key; 
> > Text valueText = (Text)value;  String valueString = new 
> > String(valueText.getBytes(), "UTF-8");  String keyString = 
> > getXmlKey(valueString);  keyText.set(keyString); 
> > valueText.set(valueString);  output.collect(keyText, valueText);  }
> >
> >
> >  private String getXmlKey(String value) {
> >        // Get the Key from the XML in the value.
> >  }
> >
> > }
> >
> > What was interesting about this is the fact that the XML was getting

> > munged within the Map Phase.  When I changed over to the code at the

> > top, the Map phase was fine.  However, the Reduce phase picks up the

> > munged XML.  Trying to debug the problem, I came across this method 
> > in
>
> > the Text Object:
> >
> > public void set(byte[] utf8, int start, int len) {
> >    setCapacity(len, false);
> >    System.arraycopy(utf8, start, bytes, 0, len);
> >    this.length = len;
> > }
> >
> > If the "bytes" array had a length of 1000 and the "utf8" array has a

> > length of 500; doing a System.arraycopy() would only copy the first 
> > 500 from "utf8" to "bytes" but leave the last 500 in "bytes" alone.
> > Could this be the cause of the XML munging?
> >
> > All of this leads me to a few questions:
> >
> > 1) Has anyone successfully used XML snippets as the data format 
> > within
>
> > a MapReduce job; not just reading from the file but used during the 
> > shuffle?
> > 2) Is anyone seeing this problem with XML or any other format?
> > 3) Does anyone know what is going on?
> > 4) Is this a bug?
> >
> >
> > Thanks,
> >
> > Peter
> >
> >
> >
>
>
>

RE: Hadoop and XML

Reply via email to