I'm not sure what you mean by "flat format" here.

In my scenario, I have an file input.xml that looks like this.

<myfile>
   <section>
      <value>1</value>
   </section>
   <section>
      <value>2</value>
   </section>
</myfile>

input.xml is a plain text file. Not a sequence file. If I read it with the
XMLInputFormat my mapper gets called with (key, value) pairs that look like
this:

(nnnn, <section><value>1</value></section>)
(nnnn, <section><value>2</value></section>)

Where the keys are numerical offsets into the file. I then use this
information to write a sequence file with these (key, value) pairs. So my
Hadoop job that uses XMLInputFormat takes a text file as input and produces
a sequence file as output.

I don't know a rule of thumb for how many small files is too many. Maybe
someone else on the list can chime in. I just know that when your
throughput gets slow that's one possible cause to investigate.

Reply via email to