The DFSClient API has methods to set the block size when a file is created.
From DFSClient.java
FSOutputStream create(UTF8 src, boolean overwrite, short replication,
long blockSize) throws IOException
and
FSOutputStream create(UTF8 src, boolean overwrite, short replication,
long blockSize, Progressable progress) throws
IOException
Andrzej Bialecki wrote:
Eric Baldeschwieler wrote:
You might try setting the block size for these files to be "very
large". This should guaranty that the entire file ends up on one node.
If an index is composed of many files, you could "tar" them together
so each index is exactly one file.
Might work... Of course as indexes get really large, this approach
might have side effects.
Sorry to be so obstinate, but this won't work either. First, when
segments are created they use whatever default block size is there (64MB
?). Is there a per-file setBlockSize in the API? I couldn't find it - if
there isn't then the cluster would have to be shutdown, reconfigured,
started, and the segment data would have to be copied to change its
block size ... yuck.
Index cannot be tar-ed, because Lucene needs direct access to several
files included in the index.
Index sizes are several gigabytes, and consist of ~30 files per each
segment. Segment data is several tens of gigabytes in 4 MapFiles per
segment.