[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

marmbrus Fri, 18 Apr 2014 17:39:06 -0700

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-40855861
  
    Thanks for doing this!
    
    I think we are actually okay for `intersect` and `subtract` as anything in 
the result must be a row that was in the original RDD and thus must have a 
correct schema.  If you intersect with a different schema you will get back an 
empty rdd.  If you subtract with an different schema the subtraction will be a 
no-op and you'll get back the original RDD.
    
    Union is a little more troublesome.  We could check the schema and throw an 
error if they don't match, but that is kinda changing the semantics relative 
the the standard `union` call on RDD.  Also, when we do a SQL union we do type 
widening, so just calling RDD union and returning a `SchemaRDD` is a little 
weird.
    
    So, I'd propose we leave union out, as users that want SQL semantics here 
can already call unionAll. @mateiz might have thoughts here too.
    
    A few other methods we can add that also don't change the schema:
     - `distinct()` with no `numPartitions`
     - `repartition`
     - `setName(...)` ?
     - `randomSplit` (not sure if this is okay since `Array` is invariant)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Reply via email to