Yes.  CSV conversation and vector encoding is very often a limiting factor
with code like this.  Multi-threaded encoding is sometimes used to mitigate
this cost.  Another option is to use map-reduce to do vector conversion
storing results into persistent files in a (hopefully) compact form.

Another independent option at classification time is to factor out aspects
of the model that can be precomputed.  This can limit the number of features
that require encoding with an obvious benefit.

On Fri, May 20, 2011 at 6:46 AM, XiaoboGu (JIRA) <j...@apache.org> wrote:

> There is another question, because there are concurrent threads training
> the examples, will the scores option cause concurrent performance problems,
> because the main thread will read and convert csv records into Vectors, will
> it become a bottleneck ?

Reply via email to