[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

piaozhexiu Tue, 15 Sep 2015 16:31:31 -0700

Github user piaozhexiu commented on the pull request:

    https://github.com/apache/spark/pull/8512#issuecomment-140580241
  
    @yhuai @liancheng thank you very much for reviewing! I'll update my PR to 
incorporate your comments soon.
    
    To answer your questions-
    > I am trying to catch up with the context at here. So, the current version 
of PR reflects the curve e of this plot, right?
    
    Yes.
    
    > If we remove the change in UnionRDD but use AWS's lib, the curve will be 
c?
    
    The "c" actually represents a different implementation that I abandoned 
after discussion with @davies. Instead of making a `listObject` call per 
partition, it bulk-lists objects with the longest common prefix of all 
partitions, and then filters out unnecessary returned objects afterwards. But 
it turned out that it doesn't scale well. So I switched to the parallel 
bulk-listing approach ("e").
    
    > If we keep the change in UnionRDD but do not use AWS's lib, the curve 
will be b?
    
    Probably true. The "b" parallelizes file-listing via 
`FileInputFormat.getSplits` not `UnionRDD`, but I expect a similar result. The 
"b" is also abandoned after discussion with @davies for simpler code.
    
    > Do we need to have two versions of it? One for mapred API (HadoopRDD) and 
one for mapreduce API (NewHadoopRDD)? Or, we can have a single version that 
works with both APIs?
    
    That's a good question. Since I imports old `mapred` packages in source 
code, I don't think we can have a single version that works with both old and 
new APIs. 
    
    In fact, I was always curious what your plan is on migrating from 
`HadoopRDD` to `NewHadoopRDD`. I imagine you will do it when dropping Hadoop 1 
support? If so, can we maintain the current version (old `mapred` API) for now 
until the `NewHadoopRDD` migration happens in Spark? Then, we don't have to 
maintain two copies of same code.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

Reply via email to