Hi, I have not looked into why this would be needed, but given it is needed, I added a couple of comments to the PR. Overall, it looks promising.
Regards, Mridul On Tue, Feb 25, 2014 at 8:05 AM, 尹绪森 <yinxu...@gmail.com> wrote: > Hi community, > > As I moving forward to write a LDA (Latent Dirichlet Allocation) to Spark > mllib, I find that a small files input API is useful, so I am writing a > smallTextFiles() to support it. > > smallTextFiles() will digest a directory of text files, and return an > RDD[(String, String)], the former String is the file name, while the latter > one is the contents of the text file. > > smallTextFiles() can be used for local disk IO, or HDFS IO, just like the > textFiles() in SparkContext. In the scenario of LDA, there are 2 common > uses: > > 1. We use smallTextFiles() to preprocess local disk files, i.e. combine > those files into a huge one, then transfer it onto HDFS to do further > process, such as LDA clustering. > > 2. We can also transfer the raw directory of small files onto HDFS (though > it is not recommended, because it will cost too many namenode entries), > then clustering it directly with LDA. > > I also find in the Spark mail list that there are some users need this > function. > > I have already finished it, but I am trying to remove a useless shuffle to > improve the performance now. Here is my code and all testsuites have passed. > https://github.com/yinxusen/incubator-spark/commit/ef418ea73e3cdaea9e45f60ce28fef3474872ade > > What do you think about that ? I wish for your advises, thanks ! > > -- > Best Regards > ----------------------------------- > Xusen Yin 尹绪森 > Beijing Key Laboratory of Intelligent Telecommunications Software and > Multimedia > Beijing University of Posts & Telecommunications > Intel Labs China > Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*