[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

yinxusen Sun, 30 Mar 2014 18:32:23 -0700

Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/252#issuecomment-39047305
  
    Hi @mateiz , here is my explanation:
    
    * Hadoop has no such input formant, but Mahout has. It is called 
`org.apache.mahout.text.SequenceFilesFromDirectory`. However, it is hard to 
use, and I think it is not suitable to call it from Spark directly because it 
will import some heavy packages.
    
    * For HDFS, it is not a good practice to hold so many small files, it will 
occupy lots of NameNode entries, with many blanks in blocks of DataNode. So I 
do not know whether it is generally used in other programs. But it is useful in 
machine learning algorithms such as Latent Dirichlet Allocation. Indeed, it is 
the pre - pull request for my LDA implementation.
    
    * A 100MB single file will usually be hold in two blocks with 2 replicas 
each. The `CombineFileInputFormat` class in mapred, i.e. the `HadoopFile` API 
in `SparkContext` cannot handle the split problem, because it allocates blocks 
to splits without the single file semantic. But the `CombineFileInputFormat` 
class in mapreduce do that, if we set the `isSplit()` function to false, it 
will put a single file into the same split, no matter whether the file exceeds 
a block size or not.
    
    * I have tested the split problem in the former [test 
suite](https://github.com/yinxusen/spark/commit/78c0f259a848aadc168edd76f9992ed4404bc510#diff-3f8bae96199c64e746098bd7a6d143e1R72)
 `fs.create(new Path(inputDir, fileName), true, 4096, 2, 512, null)`. Here I 
set the block size to 512, and I use three different file sizes to test it.
    
    @mengxr Sorry I forget to test the split problem when using local disk as 
input source. I will add it ASAP. I think it will also have the chance to 
adjust block size when reading from local disk, or I have to write a file whose 
size exceeds 32MB (default local disk block size in Hadoop).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1133] Add whole text files reader in ML...

Reply via email to