This is great feedback to hear. I think there was discussion about moving Pipelines outside of ML at some point, but I'll have to spend more time to dig it up.
In the meantime, I thought I'd mention this JIRA here in case people have feedback: https://issues.apache.org/jira/browse/SPARK-14033 --> It's about merging the concepts of Estimator and Model. It would be a breaking change in 2.0, but it would help to simplify the API and reduce code duplication. Regarding making shared params public: https://issues.apache.org/jira/browse/SPARK-7146 --> I'd like to do this for 2.0, though maybe not for all shared params Joseph On Mon, Mar 28, 2016 at 12:49 AM, Michał Zieliński < zielinski.mich...@gmail.com> wrote: > Hi Maciej, > > Absolutely. We had to copy HasInputCol/s, HasOutputCol/s (along with a > couple of others like HasProbabilityCol) to our repo. Which for most > use-cases is good enough, but for some (e.g. operating on any Transformer > that accepts either our or Sparks HasInputCol) makes the code clunky. > Opening those traits to the public would be a big gain. > > Thanks, > Michal > > On 28 March 2016 at 07:44, Jacek Laskowski <ja...@japila.pl> wrote: > >> Hi, >> >> Never develop any custom Transformer (or UnaryTransformer in particular), >> but I'd be for it if that's the case. >> >> Jacek >> 28.03.2016 6:54 AM "Maciej Szymkiewicz" <mszymkiew...@gmail.com> >> napisał(a): >> >>> Hi Jacek, >>> >>> In this context, don't you think it would be useful, if at least some >>> traits from org.apache.spark.ml.param.shared.sharedParams were >>> public?HasInputCol(s) and HasOutputCol for example. These are useful >>> pretty much every time you create custom Transformer. >>> >>> -- >>> Pozdrawiam, >>> Maciej Szymkiewicz >>> >>> >>> On 03/26/2016 10:26 AM, Jacek Laskowski wrote: >>> > Hi Joseph, >>> > >>> > Thanks for the response. I'm one who doesn't understand all the >>> > hype/need for Machine Learning...yet and through Spark ML(lib) glasses >>> > I'm looking at ML space. In the meantime I've got few assignments (in >>> > a project with Spark and Scala) that have required quite extensive >>> > dataset manipulation. >>> > >>> > It was when I sinked into using DataFrame/Dataset for data >>> > manipulation not RDD (I remember talking to Brian about how RDD is an >>> > "assembly" language comparing to the higher-level concept of >>> > DataFrames with Catalysts and other optimizations). After few days >>> > with DataFrame I learnt he was so right! (sorry Brian, it took me >>> > longer to understand your point). >>> > >>> > I started using DataFrames in far too many places than one could ever >>> > accept :-) I was so...carried away with DataFrames (esp. show vs >>> > foreach(println) and UDFs via udf() function) >>> > >>> > And then, when I moved to Pipeline API and discovered Transformers. >>> > And PipelineStage that can create pipelines of DataFrame manipulation. >>> > They read so well that I'm pretty sure people would love using them >>> > more often, but...they belong to MLlib so they are part of ML space >>> > (not many devs tackled yet). I applied the approach to using >>> > withColumn to have better debugging experience (if I ever need it). I >>> > learnt it after having watched your presentation about Pipeline API. >>> > It was so helpful in my RDD/DataFrame space. >>> > >>> > So, to promote a more extensive use of Pipelines, PipelineStages, and >>> > Transformers, I was thinking about moving that part to SQL/DataFrame >>> > API where they really belong. If not, I think people might miss the >>> > beauty of the very fine and so helpful Transformers. >>> > >>> > Transformers are *not* a ML thing -- they are DataFrame thing and >>> > should be where they really belong (for their greater adoption). >>> > >>> > What do you think? >>> > >>> > >>> > Pozdrawiam, >>> > Jacek Laskowski >>> > ---- >>> > https://medium.com/@jaceklaskowski/ >>> > Mastering Apache Spark http://bit.ly/mastering-apache-spark >>> > Follow me at https://twitter.com/jaceklaskowski >>> > >>> > >>> > On Sat, Mar 26, 2016 at 3:23 AM, Joseph Bradley <jos...@databricks.com> >>> wrote: >>> >> There have been some comments about using Pipelines outside of ML, >>> but I >>> >> have not yet seen a real need for it. If a user does want to use >>> Pipelines >>> >> for non-ML tasks, they still can use Transformers + PipelineModels. >>> Will >>> >> that work? >>> >> >>> >> On Fri, Mar 25, 2016 at 8:05 AM, Jacek Laskowski <ja...@japila.pl> >>> wrote: >>> >>> Hi, >>> >>> >>> >>> After few weeks with spark.ml now, I came to conclusion that >>> >>> Transformer concept from Pipeline API (spark.ml/MLlib) should be >>> part >>> >>> of DataFrame (SQL) where they fit better. Are there any plans to >>> >>> migrate Transformer API (ML) to DataFrame (SQL)? >>> >>> >>> >>> Pozdrawiam, >>> >>> Jacek Laskowski >>> >>> ---- >>> >>> https://medium.com/@jaceklaskowski/ >>> >>> Mastering Apache Spark http://bit.ly/mastering-apache-spark >>> >>> Follow me at https://twitter.com/jaceklaskowski >>> >>> >>> >>> --------------------------------------------------------------------- >>> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>> >>> For additional commands, e-mail: dev-h...@spark.apache.org >>> >>> >>> > --------------------------------------------------------------------- >>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>> > For additional commands, e-mail: dev-h...@spark.apache.org >>> > >>> >>> >>> >