Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/448#issuecomment-40855861
Thanks for doing this!
I think we are actually okay for `intersect` and `subtract` as anything in
the result must be a row that was in the original RDD and thus must have a
correct schema. If you intersect with a different schema you will get back an
empty rdd. If you subtract with an different schema the subtraction will be a
no-op and you'll get back the original RDD.
Union is a little more troublesome. We could check the schema and throw an
error if they don't match, but that is kinda changing the semantics relative
the the standard `union` call on RDD. Also, when we do a SQL union we do type
widening, so just calling RDD union and returning a `SchemaRDD` is a little
weird.
So, I'd propose we leave union out, as users that want SQL semantics here
can already call unionAll. @mateiz might have thoughts here too.
A few other methods we can add that also don't change the schema:
- `distinct()` with no `numPartitions`
- `repartition`
- `setName(...)` ?
- `randomSplit` (not sure if this is okay since `Array` is invariant)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---