Hi community, As I moving forward to write a LDA (Latent Dirichlet Allocation) to Spark mllib, I find that a small files input API is useful, so I am writing a smallTextFiles() to support it.
smallTextFiles() will digest a directory of text files, and return an RDD[(String, String)], the former String is the file name, while the latter one is the contents of the text file. smallTextFiles() can be used for local disk IO, or HDFS IO, just like the textFiles() in SparkContext. In the scenario of LDA, there are 2 common uses: 1. We use smallTextFiles() to preprocess local disk files, i.e. combine those files into a huge one, then transfer it onto HDFS to do further process, such as LDA clustering. 2. We can also transfer the raw directory of small files onto HDFS (though it is not recommended, because it will cost too many namenode entries), then clustering it directly with LDA. I also find in the Spark mail list that there are some users need this function. I have already finished it, but I am trying to remove a useless shuffle to improve the performance now. Here is my code and all testsuites have passed. https://github.com/yinxusen/incubator-spark/commit/ef418ea73e3cdaea9e45f60ce28fef3474872ade What do you think about that ? I wish for your advises, thanks ! -- Best Regards ----------------------------------- Xusen Yin 尹绪森 Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia Beijing University of Posts & Telecommunications Intel Labs China Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*