Github user etrain commented on the pull request:
https://github.com/apache/spark/pull/2919#issuecomment-60969356
I've looked at the code here, and basically seems reasonable. One
high-level concern I have is around the programming pattern that this
encourages: complex nesting of otherwise simple structure that may make it
difficult to program against Datasets for sufficiently complicated applications.
A 'dataset' is now a collection of Row, where we have the guarantee that
all rows in a Dataset conform to the same schema. A schema is a list of (name,
type) pairs which describe the attributes available in the dataset. This seems
like a good thing to me, and is pretty much what we described in MLI (and how
conventional databases have been structured forever). So far, so good.
The concern that I have is that we are now encouraging these attributes to
be complex types. For example, where I might have had
val x = Schema(('a', classOf[String]), ('b', classOf[Double]), ..., ("z",
classOf[Double]))
This would become
val x = Schema(('a', classOf[String]), ('bGroup', classOf[Vector]), ..,
("zGroup", classOf[Vector]))
So, great, my schema now has these vector things in them, which I can
create separately, pass around, etc.
This clearly has its merits:
1) Features are groups together logically based on the process that creates
them.
2) Managing one short schema where each record is comprised of a few large
objects (say, 4 vectors, each of length 1000) is probably easier than managing
a really big schema comprised of lots small objects (say, 4000 doubles).
But, there are some major drawbacks
1) Why only stop at one level of nesting? Why not have Vector[Vector]?
2) How do learning algorithms, like SVM or PCA deal with these Datasets? Is
there an implicit conversion that flattens these things to RDD[LabeledPoint]?
Do we want to guarantee these semantics?
3) Manipulating and subsetting nested schemas like this might be tricky.
Where before I might be able to write:
val x: Dataset = input.select(Seq(0,1,2,4,180,181,1000,1001,1002))
now I might have to write
val groupSelections = Seq(Seq(0,1,2,4),Seq(0,1),Seq(0,1,2))
val x: Dataset = groupSelections.zip(input.columns).map {case (gs, col) =>
col(gs) }
Ignoring raw syntax and semantics of how you might actually map an
operation over the columns of a Dataset and get back a well-structured dataset,
I think this makes two conflicting points:
1) In the first example - presumably all the work goes into figuring out
what the subset of features you want is in this really wide feature space.
2) In the second example - thereâs a lot of gymnastics that goes into
subsetting feature groups. I think itâs clear that working with lots of
feature groups might get unreasonable pretty quickly.
If we look at R or pandas/scikit-learn as examples of projects that have
(arguably quite successfully) dealt with these interface issues, there is one
basic pattern: learning algorithms expect big tables of numbers as input. Even
here, there are some important differences:
For example, in scikit-learn, categorical features arenât supported
directly by most learning algorithms. Instead, users are responsible for
getting data from âtable with heterogenously typed columnsâ to âtable of
numbers.â with something like OneHotEncoder and other feature transformers.
In R, on the other hand, categorical features are (sometimes frustratingly)
first class citizens by virtue of the âfactorâ data type - which is
essentially and enum. Most out-of-the-box learning algorithms (like glm())
accept data frames with categorical inputs and handle them sensibly -
implicitly one hot encoding (or creating dummy variables, if you prefer) the
categorical features.
While I have a slight preference for representing things as big flat
tables, I would be fine coding either way - but I wanted to raise the issue for
discussion here before the interfaces are set in stone.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]