Re: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?

Joseph Bradley Tue, 29 Mar 2016 12:06:07 -0700

This is great feedback to hear.  I think there was discussion about moving
Pipelines outside of ML at some point, but I'll have to spend more time to
dig it up.


In the meantime, I thought I'd mention this JIRA here in case people have
feedback:
https://issues.apache.org/jira/browse/SPARK-14033
--> It's about merging the concepts of Estimator and Model.  It would be a
breaking change in 2.0, but it would help to simplify the API and reduce
code duplication.

Regarding making shared params public:
https://issues.apache.org/jira/browse/SPARK-7146
--> I'd like to do this for 2.0, though maybe not for all shared params

Joseph

On Mon, Mar 28, 2016 at 12:49 AM, Michał Zieliński <
zielinski.mich...@gmail.com> wrote:

> Hi Maciej,
>
> Absolutely. We had to copy HasInputCol/s, HasOutputCol/s (along with a
> couple of others like HasProbabilityCol) to our repo. Which for most
> use-cases is good enough, but for some (e.g. operating on any Transformer
> that accepts either our or Sparks HasInputCol) makes the code clunky.
> Opening those traits to the public would be a big gain.
>
> Thanks,
> Michal
>
> On 28 March 2016 at 07:44, Jacek Laskowski <ja...@japila.pl> wrote:
>
>> Hi,
>>
>> Never develop any custom Transformer (or UnaryTransformer in particular),
>> but I'd be for it if that's the case.
>>
>> Jacek
>> 28.03.2016 6:54 AM "Maciej Szymkiewicz" <mszymkiew...@gmail.com>
>> napisał(a):
>>
>>> Hi Jacek,
>>>
>>> In this context, don't you think it would be useful, if at least some
>>> traits from org.apache.spark.ml.param.shared.sharedParams were
>>> public?HasInputCol(s) and HasOutputCol for example. These are useful
>>> pretty much every time you create custom Transformer.
>>>
>>> --
>>> Pozdrawiam,
>>> Maciej Szymkiewicz
>>>
>>>
>>> On 03/26/2016 10:26 AM, Jacek Laskowski wrote:
>>> > Hi Joseph,
>>> >
>>> > Thanks for the response. I'm one who doesn't understand all the
>>> > hype/need for Machine Learning...yet and through Spark ML(lib) glasses
>>> > I'm looking at ML space. In the meantime I've got few assignments (in
>>> > a project with Spark and Scala) that have required quite extensive
>>> > dataset manipulation.
>>> >
>>> > It was when I sinked into using DataFrame/Dataset for data
>>> > manipulation not RDD (I remember talking to Brian about how RDD is an
>>> > "assembly" language comparing to the higher-level concept of
>>> > DataFrames with Catalysts and other optimizations). After few days
>>> > with DataFrame I learnt he was so right! (sorry Brian, it took me
>>> > longer to understand your point).
>>> >
>>> > I started using DataFrames in far too many places than one could ever
>>> > accept :-) I was so...carried away with DataFrames (esp. show vs
>>> > foreach(println) and UDFs via udf() function)
>>> >
>>> > And then, when I moved to Pipeline API and discovered Transformers.
>>> > And PipelineStage that can create pipelines of DataFrame manipulation.
>>> > They read so well that I'm pretty sure people would love using them
>>> > more often, but...they belong to MLlib so they are part of ML space
>>> > (not many devs tackled yet). I applied the approach to using
>>> > withColumn to have better debugging experience (if I ever need it). I
>>> > learnt it after having watched your presentation about Pipeline API.
>>> > It was so helpful in my RDD/DataFrame space.
>>> >
>>> > So, to promote a more extensive use of Pipelines, PipelineStages, and
>>> > Transformers, I was thinking about moving that part to SQL/DataFrame
>>> > API where they really belong. If not, I think people might miss the
>>> > beauty of the very fine and so helpful Transformers.
>>> >
>>> > Transformers are *not* a ML thing -- they are DataFrame thing and
>>> > should be where they really belong (for their greater adoption).
>>> >
>>> > What do you think?
>>> >
>>> >
>>> > Pozdrawiam,
>>> > Jacek Laskowski
>>> > ----
>>> > https://medium.com/@jaceklaskowski/
>>> > Mastering Apache Spark http://bit.ly/mastering-apache-spark
>>> > Follow me at https://twitter.com/jaceklaskowski
>>> >
>>> >
>>> > On Sat, Mar 26, 2016 at 3:23 AM, Joseph Bradley <jos...@databricks.com>
>>> wrote:
>>> >> There have been some comments about using Pipelines outside of ML,
>>> but I
>>> >> have not yet seen a real need for it.  If a user does want to use
>>> Pipelines
>>> >> for non-ML tasks, they still can use Transformers + PipelineModels.
>>> Will
>>> >> that work?
>>> >>
>>> >> On Fri, Mar 25, 2016 at 8:05 AM, Jacek Laskowski <ja...@japila.pl>
>>> wrote:
>>> >>> Hi,
>>> >>>
>>> >>> After few weeks with spark.ml now, I came to conclusion that
>>> >>> Transformer concept from Pipeline API (spark.ml/MLlib) should be
>>> part
>>> >>> of DataFrame (SQL) where they fit better. Are there any plans to
>>> >>> migrate Transformer API (ML) to DataFrame (SQL)?
>>> >>>
>>> >>> Pozdrawiam,
>>> >>> Jacek Laskowski
>>> >>> ----
>>> >>> https://medium.com/@jaceklaskowski/
>>> >>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>>> >>> Follow me at https://twitter.com/jaceklaskowski
>>> >>>
>>> >>> ---------------------------------------------------------------------
>>> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> >>> For additional commands, e-mail: dev-h...@spark.apache.org
>>> >>>
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: dev-h...@spark.apache.org
>>> >
>>>
>>>
>>>
>

Re: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?

Reply via email to