sounds great! On Tue, Oct 11, 2016 at 11:39 PM Cyrille Chépélov < [email protected]> wrote:
> Oscar, Piyush, > > thanks for the feedback! > > At the moment, I'm not sure it's realistic to fully break the dependency > to "hadoop" completely out of scalding-core. As an intermediate goal, I'd > shoot for at least soft-removing the assumption that the *processing* is > made on Hadoop, but the storage interface will pretty much remain HDFS for > the time being (IOW, I'll leave Source essentially unchanged in > scalding-core). > > Meanwhile, I'm taking the messages here and on the gitter channel as > positive towards the principle of scalding-$FABRIC sub-modules, and will > start working on that in the background. > > > -- Cyrille > > > Le 12/10/2016 à 03:29, 'Oscar Boykin' via Scalding Development a écrit : > > Generally, I think this is a good idea also (separate modules for > fabrics). > > I agree that Mode and Job are a bit hairy in spots. I think we can remove > some deprecated code if it makes life significantly easier, but source and > binary compatibility should be kept as much as we can reasonably manage. > > I would actually really rather `buildFlow` be private[scalding] but maybe > that is too much. Making it return a subclass of Flow seems like a fine > idea to me at the moment. > > Breaking hadoop out of scalding-core seems pretty hard since `Source` has > it baked in at a few spots. That said, the Source abstractions in scalding > are not very great. If we could improve that (without removing support for > the old stuff) it might be worth it. Many have complained about Source's > design over the years, but we have not really had a full proposal that > seems to address all the concerns. > > The desire for jobs to all look the same across all fabrics make > modularization a bit ugly. > > On Tue, Oct 11, 2016 at 2:23 PM 'Piyush Narang' via Scalding Development < > [email protected]> wrote: > > We ran into similar problems while trying to set the number of reducers > while testing out Cascading3 on Tez. We hacked around it temporarily > <https://github.com/twitter/scalding/commit/57983601c7db4ef1e0df3350140d473f371e6bb3> > but > haven't yet cleaned up that code and put it out for review (we'll need to > fork MR / Tez there as nodeConfigDef works for Tez but not Hadoop). Based > on my understanding, so far we've tried to delegate as much of this to > Cascading as we can but there seem to be a few places where we're doing > some platform specific stuff in Scalding. Breaking up to create > fabric-specific sub-modules seems like a nice idea to me. We might need to > think through the right way to do this to ensure we don't break stuff. > Would it make sense to spin up an issue and we can discuss on it? > > On Tue, Oct 11, 2016 at 10:42 AM, Cyrille Chépélov < > [email protected]> wrote: > > Hi, > > I'm trying to tie a few loose ends in the way step descriptions (text > typically passed via *.withDescriptions(...)*) and the desired level of > parallelism (typically passed via *.withReducers(N)*) is pushed on the > various fabrics. > > Right now: > > - Most of the scalding code base either ignores the back-end (good) or > assumes > > <https://github.com/twitter/scalding/blob/7ed0f92a946ad8407645695d3def62324f78ac41/scalding-core/src/main/scala/com/twitter/scalding/ExecutionContext.scala#L81> > the world is either Local or HadoopFlow (which covers Hadoop 1.x and MR1). > As a consequence, a couple things don't yet work smoothly on Tez and I > assume on Flink. > - the descriptions are entirely dropped if not running on Hadoop1 or > MR1 > - .withReducers sets a hadoop-specific property (*mapred*.*reduce*. > *tasks*) at RichPipe#L41 > > <https://github.com/twitter/scalding/blob/7ed0f92a946ad8407645695d3def62324f78ac41/scalding-core/src/main/scala/com/twitter/scalding/RichPipe.scala#L41> > - the Tez fabric ignores .withReducers; and there is no other conduit > (for now) to set the number of desired parts on the sinks. As a > consequence, you can't run a tez DAG with a large level of parallelism and > a small (single) number of output files (e.g. stats leading to a result > file of a couple dozen lines); you must pick one and set > *cascading.flow.runtime.gather.partitions.num*. There are workarounds, > but they're quite ugly. > - there are a few instance of "flow match { case HadoopFlow => > doSomething ; case _ => () }" scattered around the code > - there's some heavily reflection-based code in Mode.scala > > <https://github.com/twitter/scalding/blob/7ed0f92a946ad8407645695d3def62324f78ac41/scalding-core/src/main/scala/com/twitter/scalding/Mode.scala#L75> > which depends on jars not part of the scalding build process (and it's good > that these jars stay out of the scalding-core build, e.g. Tez client > libraries) > - While it may be desirable to experiment with scalding-specific > transform registries for cascading (e.g. to deal with the Merge-GroupBy > structure, or to perform tests/assertions on the resulting flow graph), it > would be impractical to perform the necessary fabric-specific adjustments > in Mode.scala as it is. > > I'm trying to find a way to extract away the MR-isms, and push it into > fabric-specific code which can be called when appropriate. > > Questions: > > 1. Would it be appropriate to start having fabric-specific jars > (scalding-fabric-hadoop, scalding-fabric-hadoop2-mr1, scalding-fabric-tez > etc.), push the fabric-specific code from Mode.scala there ? > > (we'd keep only the single scalding fabric-related factory using > reflection, with appropriate interfaces defined in scalding-core) > > 2. Pushing the fabric-specific code into dedicated jars would probably > have user-visible consequences, as we can't make scalding-core depend on > scalding-fabric-hadoop (for back-compatibility) unless the fabric-factory > interface go into another jar. > > From my point of view, I would find that intentionally slightly > breaking the build once upon upgrade for the purpose of letting the world > know that there are other fabrics than MR1 might be acceptable, and on the > other hand I haven't used MR1 for over a year. > > Is this "slight" dependency breakage acceptable, or is it better to > have scalding-core still imply the hadoop fabrics? > > 3. Right now, scalding's internals sometimes use Hadoop (MR) specifics > to carry various configuration values. Is it acceptable to (at least in the > beginning) continue doing so, kindling asking the respective non-hadoop > fabrics to pick these values up and convert to the relevant APIs? > > 4. Is it okay to drop the @deprecated(..., "0.12.0") functions from > Mode.scala if they are inconvenient to carry over in the process? > > 5. Currently, Job.buildFlow > > <https://github.com/twitter/scalding/blob/7ed0f92a946ad8407645695d3def62324f78ac41/scalding-core/src/main/scala/com/twitter/scalding/Job.scala#L223> > returns Flow[_]. Is it okay to have it return Flow[_] with > ScaldingFlowGoodies instead, ScaldingFlowGoodies being the provisional > interface name where to move the old "flow match { case HadoopFlow => > ... }" code? > > Thanks in advance > > -- Cyrille > > -- > You received this message because you are subscribed to the Google Groups > "Scalding Development" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > > > > > -- > - Piyush > -- > You received this message because you are subscribed to the Google Groups > "Scalding Development" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > > -- > You received this message because you are subscribed to the Google Groups > "Scalding Development" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > > > -- > You received this message because you are subscribed to the Google Groups > "Scalding Development" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "Scalding Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
