Github user yinxusen commented on the pull request:
https://github.com/apache/spark/pull/252#issuecomment-39047305
Hi @mateiz , here is my explanation:
* Hadoop has no such input formant, but Mahout has. It is called
`org.apache.mahout.text.SequenceFilesFromDirectory`. However, it is hard to
use, and I think it is not suitable to call it from Spark directly because it
will import some heavy packages.
* For HDFS, it is not a good practice to hold so many small files, it will
occupy lots of NameNode entries, with many blanks in blocks of DataNode. So I
do not know whether it is generally used in other programs. But it is useful in
machine learning algorithms such as Latent Dirichlet Allocation. Indeed, it is
the pre - pull request for my LDA implementation.
* A 100MB single file will usually be hold in two blocks with 2 replicas
each. The `CombineFileInputFormat` class in mapred, i.e. the `HadoopFile` API
in `SparkContext` cannot handle the split problem, because it allocates blocks
to splits without the single file semantic. But the `CombineFileInputFormat`
class in mapreduce do that, if we set the `isSplit()` function to false, it
will put a single file into the same split, no matter whether the file exceeds
a block size or not.
* I have tested the split problem in the former [test
suite](https://github.com/yinxusen/spark/commit/78c0f259a848aadc168edd76f9992ed4404bc510#diff-3f8bae96199c64e746098bd7a6d143e1R72)
`fs.create(new Path(inputDir, fileName), true, 4096, 2, 512, null)`. Here I
set the block size to 512, and I use three different file sizes to test it.
@mengxr Sorry I forget to test the split problem when using local disk as
input source. I will add it ASAP. I think it will also have the chance to
adjust block size when reading from local disk, or I have to write a file whose
size exceeds 32MB (default local disk block size in Hadoop).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---