Moving the variable to a local variable did not seem to work:
</PrivateRateSet>vateRateSet>
public void map(Object key, Object value, OutputCollector output, Reporter
reporter) throws IOException {
Text valueText = (Text)value;
String valueString = new String(valueText.getBytes(), "UTF-8");
String keyString = getXmlKey(valueString);
Text returnKeyText = new Text();
Text returnValueText = new Text();
returnKeyText.set(keyString);
returnValueText.set(valueString);
output.collect(returnKeyText, returnValueText);
}
-----Original Message-----
From: Peter Minearo [mailto:[email protected]]
Sent: Fri 7/16/2010 2:51 PM
To: [email protected]
Subject: RE: Hadoop and XML
Whoops....right after I sent it and someone else made a suggestion; I
realized what question 2 was about. I can try that, but wouldn't that
cause Object bloat? During the Hadoop training I went through; it was
mentioned to reuse the returning Key and Value objects to keep the
number of Objects created down to a minimum. Is this not really a valid
point?
-----Original Message-----
From: Peter Minearo [mailto:[email protected]]
Sent: Friday, July 16, 2010 2:44 PM
To: [email protected]
Subject: RE: Hadoop and XML
I am not using multi-threaded Map tasks. Also, if I understand your
second question correctly:
"Also can you try creating the output key and values in the map
method(method lacal) ?"
In the first code snippet I am doing exactly that.
Below is the class that runs the Job.
public class HadoopJobClient {
private static final Log LOGGER =
LogFactory.getLog(Prds.class.getName());
public static void main(String[] args) {
JobConf conf = new JobConf(Prds.class);
conf.set("xmlinput.start", "<PrivateRateSet>");
conf.set("xmlinput.end", "</PrivateRateSet>");
conf.setJobName("PRDS Parse");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(PrdsMapper.class);
conf.setReducerClass(PrdsReducer.class);
conf.setInputFormat(XmlInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
// Run the job
try {
JobClient.runJob(conf);
} catch (IOException e) {
LOGGER.error(e.getMessage(), e);
}
}
}
-----Original Message-----
From: Soumya Banerjee [mailto:[email protected]]
Sent: Fri 7/16/2010 2:29 PM
To: [email protected]
Subject: Re: Hadoop and XML
Hi,
Can you please share the code of the job submission client ?
Also can you try creating the output key and values in the map
method(method
lacal) ?
Make sure you are not using multi threaded map task configuration.
map()
{
private Text keyText = new Text();
private Text valueText = new Text();
//rest of the code
}
Soumya.
On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo <
[email protected]> wrote:
> I have an XML file that has sparse data in it. I am running a
> MapReduce Job that reads in an XML file, pulls out a Key from within
> the XML snippet and then hands back the Key and the XML snippet (as
> the Value) to the OutputCollector. The reason is to sort the file
back into order.
> Below is the snippet of code.
>
> public class XmlMapper extends MapReduceBase implements Mapper {
>
> private Text keyText = new Text();
> private Text valueText = new Text();
>
> @SuppressWarnings("unchecked")
> public void map(Object key, Object value, OutputCollector output,
> Reporter reporter) throws IOException { Text valueText = (Text)value;
> String valueString = new String(valueText.getBytes(), "UTF-8");
> String keyString = getXmlKey(valueString);
> getKeyText().set(keyString); getValueText().set(valueString);
> output.collect(getKeyText(), getValueText()); }
>
>
> public Text getKeyText() {
> return keyText;
> }
>
>
> public void setKeyText(Text keyText) { this.keyText = keyText; }
>
>
> public Text getValueText() {
> return valueText;
> }
>
>
> public void setValueText(Text valueText) { this.valueText =
> valueText; }
>
>
> private String getXmlKey(String value) {
> // Get the Key from the XML in the value.
> }
>
> }
>
> The XML snippet from the Value is fine when it is passed into the
> map() method. I am not changing any data either, just pulling out
> information for the key. The problem I am seeing is between the Map
> phase and the Reduce phase, the XML is getting munged. For Example:
>
> </PrivateRate>
> </PrivateRateSet>te>
>
> It is my understanding that Hadoop uses the same instance of the Key
> and Value object when calling the Map method. What changes is the
> data within those instances. So, I ran an experiment where I do not
> have different Key or Value Text Objects. I reuse the ones passed
> into the method, like below:
>
> public class XmlMapper extends MapReduceBase implements Mapper {
>
> @SuppressWarnings("unchecked")
> public void map(Object key, Object value, OutputCollector output,
> Reporter reporter) throws IOException { Text keyText = (Text)key;
> Text valueText = (Text)value; String valueString = new
> String(valueText.getBytes(), "UTF-8"); String keyString =
> getXmlKey(valueString); keyText.set(keyString);
> valueText.set(valueString); output.collect(keyText, valueText); }
>
>
> private String getXmlKey(String value) {
> // Get the Key from the XML in the value.
> }
>
> }
>
> What was interesting about this is the fact that the XML was getting
> munged within the Map Phase. When I changed over to the code at the
> top, the Map phase was fine. However, the Reduce phase picks up the
> munged XML. Trying to debug the problem, I came across this method in
> the Text Object:
>
> public void set(byte[] utf8, int start, int len) {
> setCapacity(len, false);
> System.arraycopy(utf8, start, bytes, 0, len);
> this.length = len;
> }
>
> If the "bytes" array had a length of 1000 and the "utf8" array has a
> length of 500; doing a System.arraycopy() would only copy the first
> 500 from "utf8" to "bytes" but leave the last 500 in "bytes" alone.
> Could this be the cause of the XML munging?
>
> All of this leads me to a few questions:
>
> 1) Has anyone successfully used XML snippets as the data format within
> a MapReduce job; not just reading from the file but used during the
> shuffle?
> 2) Is anyone seeing this problem with XML or any other format?
> 3) Does anyone know what is going on?
> 4) Is this a bug?
>
>
> Thanks,
>
> Peter
>
>
>