[ https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188919#comment-14188919 ]
Evan Sparks commented on SPARK-3573: ------------------------------------ This comment originally appeared on the PR associated with this feature. (https://github.com/apache/spark/pull/2919): I've looked at the code here, and it basically seems reasonable. One high-level concern I have is around the programming pattern that this encourages: complex nesting of otherwise simple structure that may make it difficult to program against Datasets for sufficiently complicated applications. A 'dataset' is now a collection of Row, where we have the guarantee that all rows in a Dataset conform to the same schema. A schema is a list of (name, type) pairs which describe the attributes available in the dataset. This seems like a good thing to me, and is pretty much what we described in MLI (and how conventional databases have been structured forever). So far, so good. The concern that I have is that we are now encouraging these attributes to be complex types. For example, where I might have had val x = Schema(('a', classOf[String]), ('b', classOf[Double]), ..., ("z", classOf[Double])) This would become val x = Schema(('a', classOf[String]), ('bGroup', classOf[Vector]), .., ("zGroup", classOf[Vector])) So, great, my schema now has these vector things in them, which I can create separately, pass around, etc. This clearly has its merits: 1) Features are groups together logically based on the process that creates them. 2) Managing one short schema where each record is comprised of a few large objects (say, 4 vectors, each of length 1000) is probably easier than managing a really big schema comprised of lots small objects (say, 4000 doubles). But, there are some major drawbacks 1) Why only stop at one level of nesting? Why not have Vector[Vector]? 2) How do learning algorithms, like SVM or PCA deal with these Datasets? Is there an implicit conversion that flattens these things to RDD[LabeledPoint]? Do we want to guarantee these semantics? 3) Manipulating and subsetting nested schemas like this might be tricky. Where before I might be able to write: val x: Dataset = input.select(Seq(0,1,2,4,180,181,1000,1001,1002)) now I might have to write val groupSelections = Seq(Seq(0,1,2,4),Seq(0,1),Seq(0,1,2)) val x: Dataset = groupSelections.zip(input.columns).map {case (gs, col) => col(gs) } Ignoring raw syntax and semantics of how you might actually map an operation over the columns of a Dataset and get back a well-structured dataset, I think this makes two conflicting points: 1) In the first example - presumably all the work goes into figuring out what the subset of features you want is in this really wide feature space. 2) In the second example - there’s a lot of gymnastics that goes into subsetting feature groups. I think it’s clear that working with lots of feature groups might get unreasonable pretty quickly. If we look at R or pandas/scikit-learn as examples of projects that have (arguably quite successfully) dealt with these interface issues, there is one basic pattern: learning algorithms expect big tables of numbers as input. Even here, there are some important differences: For example, in scikit-learn, categorical features aren’t supported directly by most learning algorithms. Instead, users are responsible for getting data from “table with heterogenously typed columns” to “table of numbers.” with something like OneHotEncoder and other feature transformers. In R, on the other hand, categorical features are (sometimes frustratingly) first class citizens by virtue of the “factor” data type - which is essentially and enum. Most out-of-the-box learning algorithms (like glm()) accept data frames with categorical inputs and handle them sensibly - implicitly one hot encoding (or creating dummy variables, if you prefer) the categorical features. While I have a slight preference for representing things as big flat tables, I would be fine coding either way - but I wanted to raise the issue for discussion here before the interfaces are set in stone. > Dataset > ------- > > Key: SPARK-3573 > URL: https://issues.apache.org/jira/browse/SPARK-3573 > Project: Spark > Issue Type: Sub-task > Components: MLlib > Reporter: Xiangrui Meng > Assignee: Xiangrui Meng > Priority: Critical > > This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra > ML-specific metadata embedded in its schema. > .Sample code > Suppose we have training events stored on HDFS and user/ad features in Hive, > we want to assemble features for training and then apply decision tree. > The proposed pipeline with dataset looks like the following (need more > refinements): > {code} > sqlContext.jsonFile("/path/to/training/events", > 0.01).registerTempTable("event") > val training = sqlContext.sql(""" > SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, > event.action AS label, > user.gender AS userGender, user.country AS userCountry, > user.features AS userFeatures, > ad.targetGender AS targetGender > FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = > ad.id;""").cache() > val indexer = new Indexer() > val interactor = new Interactor() > val fvAssembler = new FeatureVectorAssembler() > val treeClassifer = new DecisionTreeClassifer() > val paramMap = new ParamMap() > .put(indexer.features, Map("userCountryIndex" -> "userCountry")) > .put(indexer.sortByFrequency, true) > .put(interactor.features, Map("genderMatch" -> Array("userGender", > "targetGender"))) > .put(fvAssembler.features, Map("features" -> Array("genderMatch", > "userCountryIndex", "userFeatures"))) > .put(fvAssembler.dense, true) > .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes > "features" and "label" columns. > val pipeline = Pipeline.create(indexer, interactor, fvAssembler, > treeClassifier) > val model = pipeline.fit(training, paramMap) > sqlContext.jsonFile("/path/to/events", 0.01).registerTempTable("event") > val test = sqlContext.sql(""" > SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, > user.gender AS userGender, user.country AS userCountry, > user.features AS userFeatures, > ad.targetGender AS targetGender > FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = > ad.id;""") > val prediction = model.transform(test).select('eventId, 'prediction) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org