[ 
https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188919#comment-14188919
 ] 

Evan Sparks commented on SPARK-3573:
------------------------------------

This comment originally appeared on the PR associated with this feature. 
(https://github.com/apache/spark/pull/2919):

I've looked at the code here, and it basically seems reasonable. One high-level 
concern I have is around the programming pattern that this encourages: complex 
nesting of otherwise simple structure that may make it difficult to program 
against Datasets for sufficiently complicated applications.

A 'dataset' is now a collection of Row, where we have the guarantee that all 
rows in a Dataset conform to the same schema. A schema is a list of (name, 
type) pairs which describe the attributes available in the dataset. This seems 
like a good thing to me, and is pretty much what we described in MLI (and how 
conventional databases have been structured forever). So far, so good.

The concern that I have is that we are now encouraging these attributes to be 
complex types. For example, where I might have had 
val x = Schema(('a', classOf[String]), ('b', classOf[Double]), ..., ("z", 
classOf[Double]))
This would become
val x = Schema(('a', classOf[String]), ('bGroup', classOf[Vector]), .., 
("zGroup", classOf[Vector]))

So, great, my schema now has these vector things in them, which I can create 
separately, pass around, etc.

This clearly has its merits:
1) Features are groups together logically based on the process that creates 
them.
2) Managing one short schema where each record is comprised of a few large 
objects (say, 4 vectors, each of length 1000) is probably easier than managing 
a really big schema comprised of lots small objects (say, 4000 doubles).

But, there are some major drawbacks
1) Why only stop at one level of nesting? Why not have Vector[Vector]? 
2) How do learning algorithms, like SVM or PCA deal with these Datasets? Is 
there an implicit conversion that flattens these things to RDD[LabeledPoint]? 
Do we want to guarantee these semantics?
3) Manipulating and subsetting nested schemas like this might be tricky. Where 
before I might be able to write:

val x: Dataset = input.select(Seq(0,1,2,4,180,181,1000,1001,1002))
now I might have to write
val groupSelections = Seq(Seq(0,1,2,4),Seq(0,1),Seq(0,1,2))
val x: Dataset = groupSelections.zip(input.columns).map {case (gs, col) => 
col(gs) }

Ignoring raw syntax and semantics of how you might actually map an operation 
over the columns of a Dataset and get back a well-structured dataset, I think 
this makes two conflicting points:
1) In the first example - presumably all the work goes into figuring out what 
the subset of features you want is in this really wide feature space.
2) In the second example - there’s a lot of gymnastics that goes into 
subsetting feature groups. I think it’s clear that working with lots of feature 
groups might get unreasonable pretty quickly.

If we look at R or pandas/scikit-learn as examples of projects that have 
(arguably quite successfully) dealt with these interface issues, there is one 
basic pattern: learning algorithms expect big tables of numbers as input. Even 
here, there are some important differences:

For example, in scikit-learn, categorical features aren’t supported directly by 
most learning algorithms. Instead, users are responsible for getting data from 
“table with heterogenously typed columns” to “table of numbers.” with something 
like OneHotEncoder and other feature transformers. In R, on the other hand, 
categorical features are (sometimes frustratingly) first class citizens by 
virtue of the “factor” data type - which is essentially and enum. Most 
out-of-the-box learning algorithms (like glm()) accept data frames with 
categorical inputs and handle them sensibly - implicitly one hot encoding (or 
creating dummy variables, if you prefer) the categorical features.

While I have a slight preference for representing things as big flat tables, I 
would be fine coding either way - but I wanted to raise the issue for 
discussion here before the interfaces are set in stone.

> Dataset
> -------
>
>                 Key: SPARK-3573
>                 URL: https://issues.apache.org/jira/browse/SPARK-3573
>             Project: Spark
>          Issue Type: Sub-task
>          Components: MLlib
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>            Priority: Critical
>
> This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra 
> ML-specific metadata embedded in its schema.
> .Sample code
> Suppose we have training events stored on HDFS and user/ad features in Hive, 
> we want to assemble features for training and then apply decision tree.
> The proposed pipeline with dataset looks like the following (need more 
> refinements):
> {code}
> sqlContext.jsonFile("/path/to/training/events", 
> 0.01).registerTempTable("event")
> val training = sqlContext.sql("""
>   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, 
> event.action AS label,
>          user.gender AS userGender, user.country AS userCountry, 
> user.features AS userFeatures,
>          ad.targetGender AS targetGender
>     FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = 
> ad.id;""").cache()
> val indexer = new Indexer()
> val interactor = new Interactor()
> val fvAssembler = new FeatureVectorAssembler()
> val treeClassifer = new DecisionTreeClassifer()
> val paramMap = new ParamMap()
>   .put(indexer.features, Map("userCountryIndex" -> "userCountry"))
>   .put(indexer.sortByFrequency, true)
>   .put(interactor.features, Map("genderMatch" -> Array("userGender", 
> "targetGender")))
>   .put(fvAssembler.features, Map("features" -> Array("genderMatch", 
> "userCountryIndex", "userFeatures")))
>   .put(fvAssembler.dense, true)
>   .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes 
> "features" and "label" columns.
> val pipeline = Pipeline.create(indexer, interactor, fvAssembler, 
> treeClassifier)
> val model = pipeline.fit(training, paramMap)
> sqlContext.jsonFile("/path/to/events", 0.01).registerTempTable("event")
> val test = sqlContext.sql("""
>   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId,
>          user.gender AS userGender, user.country AS userCountry, 
> user.features AS userFeatures,
>          ad.targetGender AS targetGender
>     FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = 
> ad.id;""")
> val prediction = model.transform(test).select('eventId, 'prediction)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to