[GitHub] spark pull request: [SPARK-3572] [sql] [mllib] User-Defined Types ...

etrain Wed, 29 Oct 2014 10:39:55 -0700

Github user etrain commented on the pull request:

    https://github.com/apache/spark/pull/2919#issuecomment-60969356
  
    I've looked at the code here, and basically seems reasonable. One 
high-level concern I have is around the programming pattern that this 
encourages: complex nesting of otherwise simple structure that may make it 
difficult to program against Datasets for sufficiently complicated applications.
    
    A 'dataset' is now a collection of Row, where we have the guarantee that 
all rows in a Dataset conform to the same schema. A schema is a list of (name, 
type) pairs which describe the attributes available in the dataset. This seems 
like a good thing to me, and is pretty much what we described in MLI (and how 
conventional databases have been structured forever). So far, so good. 
    
    The concern that I have is that we are now encouraging these attributes to 
be complex types. For example, where I might have had 
    val x = Schema(('a', classOf[String]), ('b', classOf[Double]), ..., ("z", 
classOf[Double]))
    This would become
    val x = Schema(('a', classOf[String]), ('bGroup', classOf[Vector]), .., 
("zGroup", classOf[Vector]))
    
    So, great, my schema now has these vector things in them, which I can 
create separately, pass around, etc.
    
    This clearly has its merits:
    1) Features are groups together logically based on the process that creates 
them.
    2) Managing one short schema where each record is comprised of a few large 
objects (say, 4 vectors, each of length 1000) is probably easier than managing 
a really big schema comprised of lots small objects (say, 4000 doubles).
    
    But, there are some major drawbacks
    1) Why only stop at one level of nesting? Why not have Vector[Vector]? 
    2) How do learning algorithms, like SVM or PCA deal with these Datasets? Is 
there an implicit conversion that flattens these things to RDD[LabeledPoint]? 
Do we want to guarantee these semantics?
    3) Manipulating and subsetting nested schemas like this might be tricky. 
Where before I might be able to write:
    
    val x: Dataset = input.select(Seq(0,1,2,4,180,181,1000,1001,1002))
    now I might have to write
    val groupSelections = Seq(Seq(0,1,2,4),Seq(0,1),Seq(0,1,2))
    val x: Dataset = groupSelections.zip(input.columns).map {case (gs, col) => 
col(gs) }
    
    Ignoring raw syntax and semantics of how you might actually map an 
operation over the columns of a Dataset and get back a well-structured dataset, 
I think this makes two conflicting points:
    1) In the first example - presumably all the work goes into figuring out 
what the subset of features you want is in this really wide feature space.
    2) In the second example - thereâs a lot of gymnastics that goes into 
subsetting feature groups. I think itâs clear that working with lots of 
feature groups might get unreasonable pretty quickly.
    
    If we look at R or pandas/scikit-learn as examples of projects that have 
(arguably quite successfully) dealt with these interface issues, there is one 
basic pattern: learning algorithms expect big tables of numbers as input. Even 
here, there are some important differences:
    
    For example, in scikit-learn, categorical features arenât supported 
directly by most learning algorithms. Instead, users are responsible for 
getting data from âtable with heterogenously typed columnsâ to âtable of 
numbers.â with something like OneHotEncoder and other feature transformers. 
In R, on the other hand, categorical features are (sometimes frustratingly) 
first class citizens by virtue of the âfactorâ data type - which is 
essentially and enum. Most out-of-the-box learning algorithms (like glm()) 
accept data frames with categorical inputs and handle them sensibly - 
implicitly one hot encoding (or creating dummy variables, if you prefer) the 
categorical features.
    
    While I have a slight preference for representing things as big flat 
tables, I would be fine coding either way - but I wanted to raise the issue for 
discussion here before the interfaces are set in stone.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-3572] [sql] [mllib] User-Defined Types ...

Reply via email to