[ 
https://issues.apache.org/jira/browse/SPARK-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-1133.
----------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0.0

> Add a new small files input for MLlib, which will return an RDD[(fileName, 
> content)]
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-1133
>                 URL: https://issues.apache.org/jira/browse/SPARK-1133
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>    Affects Versions: 1.0.0
>            Reporter: Xusen Yin
>            Assignee: Xusen Yin
>            Priority: Minor
>              Labels: IO, MLLib,, hadoop
>             Fix For: 1.0.0
>
>
> As I am moving forward to write a LDA (Latent Dirichlet Allocation) 
> implementation to Spark MLlib, I find that a small files input API is useful, 
> so I write a smallTextFiles() to support it.
> smallTextFiles() digests a directory of text files, then return an 
> RDD\[(String, String)\], the former String is the file name, while the latter 
> one is the contents of the text file.
> smallTextFiles() can be used for local disk I/O, or HDFS I/O, just like the 
> textFiles() in SparkContext. In the scenario of LDA, there are 2 common uses:
> 1. smallTextFiles() is used to preprocess local disk files, i.e. combine 
> those files into a huge one, then transfer it onto HDFS to do further 
> process, such as LDA clustering.
> 2. It is also used to transfer the raw directory of small files onto HDFS 
> (though it is not recommended, because it will cost too many namenode 
> entries), then clustering it directly with LDA.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to