Cool, thanks! Jira: https://issues.apache.org/jira/browse/SPARK-19247 PR: https://github.com/apache/spark/pull/16607
I think the LDA model has the exact same issues - currently the `topicsMatrix` (which is on order of numWords*k, 4GB for numWords=3m and k=1000) is saved as a single element in a case class. We should probably address this in another issue. On Fri, Jan 13, 2017 at 3:55 PM, Sean Owen <so...@cloudera.com> wrote: > Yes, certainly debatable for word2vec. You have a good point that this > could overrun the 2GB limit if the model is one big datum, for large but > not crazy models. This model could probably easily be serialized as > individual vectors in this case. It would introduce a > backwards-compatibility issue but it's possible to read old and new > formats, I believe. > > On Fri, Jan 13, 2017 at 8:16 PM Asher Krim <ak...@hubspot.com> wrote: > >> I guess it depends on the definition of "small". A Word2vec model with >> vectorSize=300 and vocabulary=3m takes nearly 4gb. While it does fit on a >> single machine (so isn't really "big" data), I don't see the benefit in >> having the model stored in one file. On the contrary, it seems that we >> would want the model to be distributed: >> * avoids shuffling of data to one executor >> * allows the whole cluster to participate in saving the model >> * avoids rpc issues (http://stackoverflow.com/questions/40842736/spark- >> word2vecmodel-exceeds-max-rpc-size-for-saving) >> * "feature parity" with mllib (issues with one large model file already >> solved for mllib in SPARK-11994 >> <https://issues.apache.org/jira/browse/SPARK-11994>) >> >> >> On Fri, Jan 13, 2017 at 1:02 PM, Nick Pentreath <nick.pentre...@gmail.com >> > wrote: >> >> Yup - it's because almost all model data in spark ML (model coefficients) >> is "small" - i.e. Non distributed. >> >> If you look at ALS you'll see there is no repartitioning since the factor >> dataframes can be large >> On Fri, 13 Jan 2017 at 19:42, Sean Owen <so...@cloudera.com> wrote: >> >> You're referring to code that serializes models, which are quite small. >> For example a PCA model consists of a few principal component vector. It's >> a Dataset of just one element being saved here. It's re-using the code path >> normally used to save big data sets, to output 1 file with 1 thing as >> Parquet. >> >> On Fri, Jan 13, 2017 at 5:29 PM Asher Krim <ak...@hubspot.com> wrote: >> >> But why is that beneficial? The data is supposedly quite large, >> distributing it across many partitions/files would seem to make sense. >> >> On Fri, Jan 13, 2017 at 12:25 PM, Sean Owen <so...@cloudera.com> wrote: >> >> That is usually so the result comes out in one file, not partitioned over >> n files. >> >> On Fri, Jan 13, 2017 at 5:23 PM Asher Krim <ak...@hubspot.com> wrote: >> >> Hi, >> >> I'm curious why it's common for data to be repartitioned to 1 partition >> when saving ml models: >> >> sqlContext.createDataFrame(Seq(data)).repartition(1).write. >> parquet(dataPath) >> >> This shows up in most ml models I've seen (Word2Vec >> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L314>, >> PCA >> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L189>, >> LDA >> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala#L605>). >> Am I missing some benefit of repartitioning like this? >> >> Thanks, >> -- >> Asher Krim >> Senior Software Engineer >> >> >> >> >> -- >> Asher Krim >> Senior Software Engineer >> >>