On Tue, Feb 14, 2012 at 10:56 AM, W.P. McNeill <bill...@gmail.com> wrote:

> I'm not sure what you mean by "flat format" here.
>
> In my scenario, I have an file input.xml that looks like this.
>
> <myfile>
>   <section>
>      <value>1</value>
>   </section>
>   <section>
>      <value>2</value>
>   </section>
> </myfile>
>
> input.xml is a plain text file. Not a sequence file. If I read it with the
> XMLInputFormat my mapper gets called with (key, value) pairs that look like
> this:
>
> (nnnn, <section><value>1</value></section>)
> (nnnn, <section><value>2</value></section>)
>
> Where the keys are numerical offsets into the file. I then use this
> information to write a sequence file with these (key, value) pairs. So my
> Hadoop job that uses XMLInputFormat takes a text file as input and produces
> a sequence file as output.
>
> I don't know a rule of thumb for how many small files is too many. Maybe
> someone else on the list can chime in. I just know that when your
> throughput gets slow that's one possible cause to investigate.
>

I need to install hadoop. Does this xmlinput format comes as part of the
install? Can you please give me some pointers that would help me install
hadoop and xmlinputformat if necessary?

Reply via email to