Re: Breaking MAPREDUCE out of scalding-core — Re: [cascading3 branch] descriptions & parallelism

'Piyush Narang' via Scalding Development Fri, 04 Nov 2016 13:20:13 -0700

This is pretty cool! Shall take a look at the WIP PR as well.

On Fri, Nov 4, 2016 at 12:57 PM, Cyrille Chépélov <
c...@transparencyrights.com> wrote:


> Thanks!
>
> That's PR#1617 <https://github.com/twitter/scalding/pull/1617>
> https://github.com/twitter/scalding/pull/1617
>
> Of note is 
> scalding-core-fabric-tests/src/fabric/scala/com/twitter/scalding/CoreTestFabric.scala
>
> <https://github.com/twitter/scalding/pull/1617/files#diff-8cbac1ca00d5b5882eb44637408c00ad>,
> which is really anything in scalding-core/src/test/scala/
> com/twitter/scalding/CoreTest.scala
> <https://github.com/twitter/scalding/pull/1617/files#diff-3e61f71d6907584e5a1857f7f925d0d8>that
> needed to .runHadoop
>
> The real fun starts in Mode.scala, Source.scala and FileSource.scala
>
>     -- Cyrille
>
>
> Le 04/11/2016 à 19:58, Oscar Boykin a écrit :
>
> Thanks for taking on this project. I'm excited about it.
>
> Can you go ahead and make a WIP PR so we can see what the diff looks like
> and start giving feedback?
>
> I'll be reviewing the WIP PR carefully.
>
> On Fri, Nov 4, 2016 at 8:43 AM Cyrille Chépélov <
> c...@transparencyrights.com> wrote:
>
>> Hi there,
>>
>> Some progress on the "separation of fabrics" project:
>>
>>     TL;DR: I have a branch here https://github.com/cchepelov/
>> scalding/tree/split-fabrics that is mostly working on *Hadoop*,
>> *Hadoop2-MR1* *and Tez*, … and baby steps on *Flink*.
>>
>>
>> *The Good *
>>
>>    - scalding-core has dependencies on Hadoop for HDFS, but no longer
>>    has explicit dependencies on MAPREDUCE
>>    - One can switch between MAPREDUCE using the legacy hadoop1 API or
>>    MAPREDUCE using Cascading's hadoop2-mr1 fabric
>>    - Most tests run on all four available fabrics in addition to Local.
>>    That is: Legacy Hadoop, Hadoop2-MR1, Tez, *and Flink*.
>>    - Switching from a fabric to another is a matter of supplying the
>>    appropriate fabric jar (scalding-fabric-hadoop,  scalding-fabric-tez, 
>> etc.)
>>    in your assembly
>>    - Even the REPL seems to accept using a different fabric (!)
>>    - Having an explicit per-fabric bit of code within Scalding enables
>>    experimentation with more advanced things, such implementing
>>    scalding-specific Cascading Planner graph transforms, as Chris advises.
>>    - I *think* I didn't break any widely-used API at the source level.
>>
>>
>> *The Bad *
>>
>>    - I *think* I didn't break any widely-used API at the source level,
>>    but I haven't (yet) checked if any damage control should/can be done
>>    - A few tests still break in Tez. This is on things that I've lived
>>    with for a long time, but fixing those should be easier and a higher
>>    priority now. For now it seems there are really two outstanding items 
>> left:
>>    1. mapping .withReducers all the way down to the level of parallelism in
>>    the TezChild node in charge of performing that processing and 2. perhaps a
>>    planner bug, or perhaps a missing scalding-specific planner transform to
>>    handle jobs involving Sketched Joins *(that's on cascading 3.2-wip-6)*
>>    - Flink is not yet ready for prime time. At the moment, I'm building
>>    it using a local snapshot reflecting https://github.com/
>>    dataArtisans/cascading-flink/pull/70
>>    <https://github.com/dataArtisans/cascading-flink/pull/70> — This is
>>    required as some of Cascading's internal interfaces changed a bit since
>>    3.1.
>>    Some of the test are bound to fail for now, as cascading-flink cannot
>>    yet map some variants of hash joins (outer right hash joins, for 
>> instance).
>>    - Mode.scala is a mess and will need a little bit of clean-up
>>    - There are still a couple tests that are bound to fail
>>    - Any test that was doing pattern maching on the exact type of Mode
>>    (Test vs. Hadoop vs. Local vs. HadoopTest) *will* fail, and there is
>>    no solution
>>    - Tez and Flink *tests* seem quite slow. Not yet sure what's
>>    happening, it seems some of the code is simply waiting and waking up long
>>    after a given test job is complete.
>>
>> *The Ugly*
>>
>>    - Mode.scala is a mess and will *really* need a little bit of clean-up
>>
>>    - we still need to compile scalding-core with a *provided *dependency
>>    to either cascading-hadoop or cascading-hadoop2-mr1. This is due to
>>    HadoopTap and friends (HDFS support). Ideally we could have a (perhaps
>>    hard?) dependency on cascading-hadoop2-io since everyone's using it
>>    (hadoop2-mr1, tez, flink), but we'd have to manage the case of
>>    cascading-hadoop (which brings almost identical copies but cannot, by
>>    trade, depend on cascading-hadoop2-io). Still slightly confused on the 
>> best
>>    course of action; I'd like things in scalding-core to actually not compile
>>    if they still accidentally depend on MAPREDUCE. I'm unsure it's achievable
>>    as it is.
>>
>>    - I've tried to share the fabric-sensitive tests from scalding-core
>>    into a pool of tests that is shared and verified with all fabrics: this is
>>    scalding-core-fabric-tests
>>
>>    Although Scalatest's internal discovery seems to be happy with
>>    running anything that looks like a test, the discovery module used by "sbt
>>    test" is different. It only looks at tests that are implemented within the
>>    current project, specifically ignoring tests inherited from dependencies.
>>
>>    I failed to find a way to convince sbt to adopt scalatest's discovery
>>    pattern. As a result, I've moved the "shared" tests from 
>> scalding-core-fabric-tests
>>    into another subdirectory of src/, which is referenced by all four
>>    fabrics as their own. As a result, this code is compiled 4 times, and
>>    IntelliJ can be confused and refusing to step into that.
>>
>>    If there is an sbt guru around willing to give me a hand on this, I'd
>>    be really grateful.
>>
>>    - Making counter implementation dependent on the fabric required
>>    passing a class *name* into fabric-specific properties, then using
>>    reflection to instantiate them up.
>>    - The smart tricks needed to make JobTest work and mock out taps
>>    which can be LocalTap or HadoopTaps pretty much at will
>>    - I couldn't really wrap my head around enough of this without
>>    actually digging in, rather than planning/designing first. Some
>>    documentation and possibly a restart from scratch might be needed after 
>> all.
>>
>> Things I'm inclined to kick to "later": can we also abstract out storage
>> from "necessarily HDFS"? Is that something useful?
>>
>> On the other hand, as the (Storage Mode) x (Execution Mode) x (data
>> Scheme) support matrix can be daunting, it can be useful to still make the
>> assumption that everything is HDFS unless it's on LocalTaps which sometimes
>> can be HadoopTap-and-a-wrapper or the other way around.
>>
>> Next steps: incorporate feedback, clean up, fix outstanding issues in
>> scalding-fabric-tez, (fix in flink in due time), keep current with the
>> develop/cascading3 branches, then figure out how to mainstream that
>> (probably, indeed, breaking up what can be broken up into individual PRs,
>> but I'm afraid there will still be a big atomic change of something at one
>> point).
>>
>> For now that's just a branch, would it make sense to open an "RFC only"
>> PR to enable the review tools?
>>
>>     -- Cyrille
>>
>>
>> Le 14/10/2016 à 22:47, 'Alex Levenson' via Scalding Development a écrit :
>>
>> This is a large enough change, that probably won't fit into a single PR,
>> that it might merit some sort of design doc / written plan. That way we can
>> come up with a plan and then start implementing it piece by piece across a
>> few PRs.
>>
>> On Wed, Oct 12, 2016 at 2:11 PM, 'Oscar Boykin' via Scalding Development
>> <scalding-dev@googlegroups.com> wrote:
>>
>> sounds great!
>>
>> On Tue, Oct 11, 2016 at 11:39 PM Cyrille Chépélov <
>> c...@transparencyrights.com> wrote:
>>
>> Oscar, Piyush,
>>
>> thanks for the feedback!
>>
>> At the moment, I'm not sure it's realistic to fully break the dependency
>> to "hadoop" completely out of scalding-core. As an intermediate goal, I'd
>> shoot for at least soft-removing the assumption that the *processing* is
>> made on Hadoop, but the storage interface will pretty much remain HDFS for
>> the time being (IOW, I'll leave Source essentially unchanged in
>> scalding-core).
>>
>> Meanwhile, I'm taking the messages here and on the gitter channel as
>> positive towards the principle of scalding-$FABRIC sub-modules, and will
>> start working on that in the background.
>>
>>
>>     -- Cyrille
>>
>>
>> Le 12/10/2016 à 03:29, 'Oscar Boykin' via Scalding Development a écrit :
>>
>> Generally, I think this is a good idea also (separate modules for
>> fabrics).
>>
>> I agree that Mode and Job are a bit hairy in spots. I think we can remove
>> some deprecated code if it makes life significantly easier, but source and
>> binary compatibility should be kept as much as we can reasonably manage.
>>
>> I would actually really rather `buildFlow` be private[scalding] but maybe
>> that is too much. Making it return a subclass of Flow seems like a fine
>> idea to me at the moment.
>>
>> Breaking hadoop out of scalding-core seems pretty hard since `Source` has
>> it baked in at a few spots. That said, the Source abstractions in scalding
>> are not very great. If we could improve that (without removing support for
>> the old stuff) it might be worth it. Many have complained about Source's
>> design over the years, but we have not really had a full proposal that
>> seems to address all the concerns.
>>
>> The desire for jobs to all look the same across all fabrics make
>> modularization a bit ugly.
>>
>> On Tue, Oct 11, 2016 at 2:23 PM 'Piyush Narang' via Scalding Development <
>> scalding-dev@googlegroups.com> wrote:
>>
>> We ran into similar problems while trying to set the number of reducers
>> while testing out Cascading3 on Tez. We hacked around it temporarily
>> <https://github.com/twitter/scalding/commit/57983601c7db4ef1e0df3350140d473f371e6bb3>
>>  but
>> haven't yet cleaned up that code and put it out for review (we'll need to
>> fork MR / Tez there as nodeConfigDef works for Tez but not Hadoop). Based
>> on my understanding, so far we've tried to delegate as much of this to
>> Cascading as we can but there seem to be a few places where we're doing
>> some platform specific stuff in Scalding. Breaking up to create
>> fabric-specific sub-modules seems like a nice idea to me. We might need to
>> think through the right way to do this to ensure we don't break stuff.
>> Would it make sense to spin up an issue and we can discuss on it?
>>
>> On Tue, Oct 11, 2016 at 10:42 AM, Cyrille Chépélov <
>> c...@transparencyrights.com> wrote:
>>
>> Hi,
>>
>> I'm trying to tie a few loose ends in the way step descriptions (text
>> typically passed via *.withDescriptions(...)*) and the desired level of
>> parallelism (typically passed via *.withReducers(N)*) is pushed on the
>> various fabrics.
>>
>> Right now:
>>
>>    - Most of the scalding code base either ignores the back-end (good)
>>    or assumes
>>    
>> <https://github.com/twitter/scalding/blob/7ed0f92a946ad8407645695d3def62324f78ac41/scalding-core/src/main/scala/com/twitter/scalding/ExecutionContext.scala#L81>
>>    the world is either Local or HadoopFlow (which covers Hadoop 1.x and MR1).
>>    As a consequence, a couple things don't yet work smoothly on Tez and I
>>    assume on Flink.
>>    - the descriptions are entirely dropped if not running on Hadoop1 or
>>    MR1
>>    - .withReducers sets a hadoop-specific property (*mapred*.*reduce*.
>>    *tasks*) at RichPipe#L41
>>    
>> <https://github.com/twitter/scalding/blob/7ed0f92a946ad8407645695d3def62324f78ac41/scalding-core/src/main/scala/com/twitter/scalding/RichPipe.scala#L41>
>>    - the Tez fabric ignores .withReducers; and there is no other conduit
>>    (for now) to set the number of desired parts on the sinks. As a
>>    consequence, you can't run a tez DAG with a large level of parallelism and
>>    a small (single) number of output files (e.g. stats leading to a result
>>    file of a couple dozen lines); you must pick one and set
>>    *cascading.flow.runtime.gather.partitions.num*. There are
>>    workarounds, but they're quite ugly.
>>    - there are a few instance of "flow match { case HadoopFlow =>
>>    doSomething ; case _ => () }" scattered around the code
>>    - there's some heavily reflection-based code in Mode.scala
>>    
>> <https://github.com/twitter/scalding/blob/7ed0f92a946ad8407645695d3def62324f78ac41/scalding-core/src/main/scala/com/twitter/scalding/Mode.scala#L75>
>>    which depends on jars not part of the scalding build process (and it's 
>> good
>>    that these jars stay out of the scalding-core build, e.g. Tez client
>>    libraries)
>>    - While it may be desirable to experiment with scalding-specific
>>    transform registries for cascading (e.g. to deal with the Merge-GroupBy
>>    structure, or to perform tests/assertions on the resulting flow graph), it
>>    would be impractical to perform the necessary fabric-specific adjustments
>>    in Mode.scala as it is.
>>
>> I'm trying to find a way to extract away the MR-isms, and push it into
>> fabric-specific code which can be called when appropriate.
>>
>> Questions:
>>
>>    1. Would it be appropriate to start having fabric-specific jars
>>    (scalding-fabric-hadoop, scalding-fabric-hadoop2-mr1, scalding-fabric-tez
>>    etc.), push the fabric-specific code from Mode.scala there ?
>>
>>    (we'd keep only the single scalding fabric-related factory using
>>    reflection, with appropriate interfaces defined in scalding-core)
>>
>>    2. Pushing the fabric-specific code into dedicated jars would
>>    probably have user-visible consequences, as we can't make scalding-core
>>    depend on scalding-fabric-hadoop (for back-compatibility) unless the
>>    fabric-factory interface go into another jar.
>>
>>    From my point of view, I would find that intentionally slightly
>>    breaking the build once upon upgrade for the purpose of letting the world
>>    know that there are other fabrics than MR1 might be acceptable, and on the
>>    other hand I haven't used MR1 for over a year.
>>
>>    Is this "slight" dependency breakage acceptable, or is it better to
>>    have scalding-core still imply the hadoop fabrics?
>>
>>    3. Right now, scalding's internals sometimes use Hadoop (MR)
>>    specifics to carry various configuration values. Is it acceptable to (at
>>    least in the beginning) continue doing so, kindling asking the respective
>>    non-hadoop fabrics to pick these values up and convert to the relevant 
>> APIs?
>>
>>    4. Is it okay to drop the @deprecated(..., "0.12.0") functions from
>>    Mode.scala if they are inconvenient to carry over in the process?
>>
>>    5. Currently, Job.buildFlow
>>    
>> <https://github.com/twitter/scalding/blob/7ed0f92a946ad8407645695d3def62324f78ac41/scalding-core/src/main/scala/com/twitter/scalding/Job.scala#L223>
>>    returns Flow[_]. Is it okay to have it return Flow[_] with
>>    ScaldingFlowGoodies instead, ScaldingFlowGoodies being the
>>    provisional interface name where to move the old "flow match { case
>>    HadoopFlow => ... }" code?
>>
>> Thanks in advance
>>
>>     -- Cyrille
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Scalding Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to scalding-dev+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
>>
>> --
>> - Piyush
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Scalding Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to scalding-dev+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Scalding Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to scalding-dev+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Scalding Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to scalding-dev+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Scalding Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to scalding-dev+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
>>
>> --
>> Alex Levenson
>> @THISWILLWORK
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Scalding Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to scalding-dev+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Scalding Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to scalding-dev+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Scalding Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to scalding-dev+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>



-- 
- Piyush

-- 
You received this message because you are subscribed to the Google Groups 
"Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scalding-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Breaking MAPREDUCE out of scalding-core — Re: [cascading3 branch] descriptions & parallelism

Reply via email to