Hi community,

As I moving forward to write a LDA (Latent Dirichlet Allocation) to Spark
mllib, I find that a small files input API is useful, so I am writing a
smallTextFiles() to support it.

smallTextFiles() will digest a directory of text files, and return an
RDD[(String, String)], the former String is the file name, while the latter
one is the contents of the text file.

smallTextFiles() can be used for local disk IO, or HDFS IO, just like the
textFiles() in SparkContext. In the scenario of LDA, there are 2 common
uses:

1. We use smallTextFiles() to preprocess local disk files, i.e. combine
those files into a huge one, then transfer it onto HDFS to do further
process, such as LDA clustering.

2. We can also transfer the raw directory of small files onto HDFS (though
it is not recommended, because it will cost too many namenode entries),
then clustering it directly with LDA.

I also find in the Spark mail list that there are some users need this
function.

I have already finished it, but I am trying to remove a useless shuffle to
improve the performance now. Here is my code and all testsuites have passed.
https://github.com/yinxusen/incubator-spark/commit/ef418ea73e3cdaea9e45f60ce28fef3474872ade

What do you think about that ? I wish for your advises, thanks !

-- 
Best Regards
-----------------------------------
Xusen Yin    尹绪森
Beijing Key Laboratory of Intelligent Telecommunications Software and
Multimedia
Beijing University of Posts & Telecommunications
Intel Labs China
Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*

Reply via email to