it has to do with the data block size,

I had many small files and the performance because much better when i merged them,

the default block size is 64Mb so redo your files to <= 64MB (what i did and recommend)
or reconfigure your hadoop.

<property>
 <name>dfs.block.size</name>
 <value>67108864</value>
 <description>The default block size for new files.</description>
</property>

do something like
cat * | rotatelogs ./merged/m 64M
it will merge and chop the data for you.

yoav.morag wrote:
hi all -
can anyone comment on the performance cost of merging many small files into
an increasingly large MapFile ? will that cost be dependent on the size of
the larger MapFile (since I have to rewrite it) or is there a built-in
strategy to split it into smaller parts, affecting only those which were
touched ? thanks -
Yoav.

Reply via email to