[GitHub] spark pull request: [SPARK-1061] assumePartitioned

squito Sun, 08 Feb 2015 18:35:23 -0800

Github user squito commented on the pull request:

    https://github.com/apache/spark/pull/4449#issuecomment-73449430
  
    @pwendell its a good question, I was wondering the same thing a little bit 
as I was writing those unit tests and was going to comment on the jira about 
this a little.  It is definitely annoying to have to write a custom input 
format -- but I only need to do that to turn off splits.  Every once in a while 
this comes up on the user list too -- should we just add another version of 
`sc.hadoopFile`, `sc.textFile`, and `sc.sequenceFile` to turn off splits?  
Unfortunately I don't think it makes any sense to directly pass an 
`assumedPartitioner` as an argument to those functions, since you really need 
to put in a map step in the middle to extract the key.
    
    Really this gets to a more general question: when do we add these 
"convenience" methods to RDD??  Given that this requires application logic to 
track the partitioner to use, I doubt this will ever be used by other code 
within spark itself.  But I would still make the case for its inclusion, since 
(a) it leads to a big optimization that is not obvious to most users.  By 
promoting it to a function within spark itself, users are more likely to be 
aware of it.  (b) its a little tricky to get right -- I think the `verify` step 
is really important to make sure this doesn't lead to completely wrong results 
in the user app.  And (c) I think its a common use case.  Not so common it 
would make it into spark tutorial's, or even into daily use of an experienced 
spark-user -- but I imagine it has a place in every "batch" use of spark, where 
there is some big dataset that lives on hdfs between spark contexts.
    
    OTOH, we could just put this in some general location with spark-examples, 
and leave it out of spark itself.  I guess we only need to make the change to 
`HadoopRDD` to sort the partitions.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1061] assumePartitioned

Reply via email to