Spark ML Pipelines API (not just Transformers, Estimators and custom Pipelines classes as well) are definitely not just machine-learning specific.
We use them heavily in our developement. We're building machine learning pipelines *BUT* many steps involve joining, schema manipulation, pre/postprocessing data for the actual statistical algorithm, having monoidal architecture (I have a slide deck if you're interested). Pipelines API is a powerful abstraction that makes things very easy for us. They are not always perfect (imho transformSchema is a little bit of a mess, maybe future Dataset API will help), but they make our pipelines very customisable and pluggable (you can add/swap/remove any PipelineStage and any point). On 26 March 2016 at 09:26, Jacek Laskowski <ja...@japila.pl> wrote: > Hi Joseph, > > Thanks for the response. I'm one who doesn't understand all the > hype/need for Machine Learning...yet and through Spark ML(lib) glasses > I'm looking at ML space. In the meantime I've got few assignments (in > a project with Spark and Scala) that have required quite extensive > dataset manipulation. > > It was when I sinked into using DataFrame/Dataset for data > manipulation not RDD (I remember talking to Brian about how RDD is an > "assembly" language comparing to the higher-level concept of > DataFrames with Catalysts and other optimizations). After few days > with DataFrame I learnt he was so right! (sorry Brian, it took me > longer to understand your point). > > I started using DataFrames in far too many places than one could ever > accept :-) I was so...carried away with DataFrames (esp. show vs > foreach(println) and UDFs via udf() function) > > And then, when I moved to Pipeline API and discovered Transformers. > And PipelineStage that can create pipelines of DataFrame manipulation. > They read so well that I'm pretty sure people would love using them > more often, but...they belong to MLlib so they are part of ML space > (not many devs tackled yet). I applied the approach to using > withColumn to have better debugging experience (if I ever need it). I > learnt it after having watched your presentation about Pipeline API. > It was so helpful in my RDD/DataFrame space. > > So, to promote a more extensive use of Pipelines, PipelineStages, and > Transformers, I was thinking about moving that part to SQL/DataFrame > API where they really belong. If not, I think people might miss the > beauty of the very fine and so helpful Transformers. > > Transformers are *not* a ML thing -- they are DataFrame thing and > should be where they really belong (for their greater adoption). > > What do you think? > > > Pozdrawiam, > Jacek Laskowski > ---- > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark http://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > > On Sat, Mar 26, 2016 at 3:23 AM, Joseph Bradley <jos...@databricks.com> > wrote: > > There have been some comments about using Pipelines outside of ML, but I > > have not yet seen a real need for it. If a user does want to use > Pipelines > > for non-ML tasks, they still can use Transformers + PipelineModels. Will > > that work? > > > > On Fri, Mar 25, 2016 at 8:05 AM, Jacek Laskowski <ja...@japila.pl> > wrote: > >> > >> Hi, > >> > >> After few weeks with spark.ml now, I came to conclusion that > >> Transformer concept from Pipeline API (spark.ml/MLlib) should be part > >> of DataFrame (SQL) where they fit better. Are there any plans to > >> migrate Transformer API (ML) to DataFrame (SQL)? > >> > >> Pozdrawiam, > >> Jacek Laskowski > >> ---- > >> https://medium.com/@jaceklaskowski/ > >> Mastering Apache Spark http://bit.ly/mastering-apache-spark > >> Follow me at https://twitter.com/jaceklaskowski > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > >> For additional commands, e-mail: dev-h...@spark.apache.org > >> > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >