Hi,

  I have not looked into why this would be needed, but given it is
needed, I added a couple of comments to the PR.
Overall, it looks promising.

Regards,
Mridul


On Tue, Feb 25, 2014 at 8:05 AM, 尹绪森 <yinxu...@gmail.com> wrote:
> Hi community,
>
> As I moving forward to write a LDA (Latent Dirichlet Allocation) to Spark
> mllib, I find that a small files input API is useful, so I am writing a
> smallTextFiles() to support it.
>
> smallTextFiles() will digest a directory of text files, and return an
> RDD[(String, String)], the former String is the file name, while the latter
> one is the contents of the text file.
>
> smallTextFiles() can be used for local disk IO, or HDFS IO, just like the
> textFiles() in SparkContext. In the scenario of LDA, there are 2 common
> uses:
>
> 1. We use smallTextFiles() to preprocess local disk files, i.e. combine
> those files into a huge one, then transfer it onto HDFS to do further
> process, such as LDA clustering.
>
> 2. We can also transfer the raw directory of small files onto HDFS (though
> it is not recommended, because it will cost too many namenode entries),
> then clustering it directly with LDA.
>
> I also find in the Spark mail list that there are some users need this
> function.
>
> I have already finished it, but I am trying to remove a useless shuffle to
> improve the performance now. Here is my code and all testsuites have passed.
> https://github.com/yinxusen/incubator-spark/commit/ef418ea73e3cdaea9e45f60ce28fef3474872ade
>
> What do you think about that ? I wish for your advises, thanks !
>
> --
> Best Regards
> -----------------------------------
> Xusen Yin    尹绪森
> Beijing Key Laboratory of Intelligent Telecommunications Software and
> Multimedia
> Beijing University of Posts & Telecommunications
> Intel Labs China
> Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*

Reply via email to