[
https://issues.apache.org/jira/browse/SPARK-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia resolved SPARK-1133.
----------------------------------
Resolution: Fixed
Fix Version/s: 1.0.0
> Add a new small files input for MLlib, which will return an RDD[(fileName,
> content)]
> ------------------------------------------------------------------------------------
>
> Key: SPARK-1133
> URL: https://issues.apache.org/jira/browse/SPARK-1133
> Project: Spark
> Issue Type: Improvement
> Components: Input/Output
> Affects Versions: 1.0.0
> Reporter: Xusen Yin
> Assignee: Xusen Yin
> Priority: Minor
> Labels: IO, MLLib,, hadoop
> Fix For: 1.0.0
>
>
> As I am moving forward to write a LDA (Latent Dirichlet Allocation)
> implementation to Spark MLlib, I find that a small files input API is useful,
> so I write a smallTextFiles() to support it.
> smallTextFiles() digests a directory of text files, then return an
> RDD\[(String, String)\], the former String is the file name, while the latter
> one is the contents of the text file.
> smallTextFiles() can be used for local disk I/O, or HDFS I/O, just like the
> textFiles() in SparkContext. In the scenario of LDA, there are 2 common uses:
> 1. smallTextFiles() is used to preprocess local disk files, i.e. combine
> those files into a huge one, then transfer it onto HDFS to do further
> process, such as LDA clustering.
> 2. It is also used to transfer the raw directory of small files onto HDFS
> (though it is not recommended, because it will cost too many namenode
> entries), then clustering it directly with LDA.
--
This message was sent by Atlassian JIRA
(v6.2#6252)