This is pretty cool! Shall take a look at the WIP PR as well.

On Fri, Nov 4, 2016 at 12:57 PM, Cyrille Chépélov <
[email protected]> wrote:

> Thanks!
>
> That's PR#1617 <https://github.com/twitter/scalding/pull/1617>
> https://github.com/twitter/scalding/pull/1617
>
> Of note is 
> scalding-core-fabric-tests/src/fabric/scala/com/twitter/scalding/CoreTestFabric.scala
>
> <https://github.com/twitter/scalding/pull/1617/files#diff-8cbac1ca00d5b5882eb44637408c00ad>,
> which is really anything in scalding-core/src/test/scala/
> com/twitter/scalding/CoreTest.scala
> <https://github.com/twitter/scalding/pull/1617/files#diff-3e61f71d6907584e5a1857f7f925d0d8>that
> needed to .runHadoop
>
> The real fun starts in Mode.scala, Source.scala and FileSource.scala
>
>     -- Cyrille
>
>
> Le 04/11/2016 à 19:58, Oscar Boykin a écrit :
>
> Thanks for taking on this project. I'm excited about it.
>
> Can you go ahead and make a WIP PR so we can see what the diff looks like
> and start giving feedback?
>
> I'll be reviewing the WIP PR carefully.
>
> On Fri, Nov 4, 2016 at 8:43 AM Cyrille Chépélov <
> [email protected]> wrote:
>
>> Hi there,
>>
>> Some progress on the "separation of fabrics" project:
>>
>>     TL;DR: I have a branch here https://github.com/cchepelov/
>> scalding/tree/split-fabrics that is mostly working on *Hadoop*,
>> *Hadoop2-MR1* *and Tez*, … and baby steps on *Flink*.
>>
>>
>> *The Good *
>>
>>    - scalding-core has dependencies on Hadoop for HDFS, but no longer
>>    has explicit dependencies on MAPREDUCE
>>    - One can switch between MAPREDUCE using the legacy hadoop1 API or
>>    MAPREDUCE using Cascading's hadoop2-mr1 fabric
>>    - Most tests run on all four available fabrics in addition to Local.
>>    That is: Legacy Hadoop, Hadoop2-MR1, Tez, *and Flink*.
>>    - Switching from a fabric to another is a matter of supplying the
>>    appropriate fabric jar (scalding-fabric-hadoop,  scalding-fabric-tez, 
>> etc.)
>>    in your assembly
>>    - Even the REPL seems to accept using a different fabric (!)
>>    - Having an explicit per-fabric bit of code within Scalding enables
>>    experimentation with more advanced things, such implementing
>>    scalding-specific Cascading Planner graph transforms, as Chris advises.
>>    - I *think* I didn't break any widely-used API at the source level.
>>
>>
>> *The Bad *
>>
>>    - I *think* I didn't break any widely-used API at the source level,
>>    but I haven't (yet) checked if any damage control should/can be done
>>    - A few tests still break in Tez. This is on things that I've lived
>>    with for a long time, but fixing those should be easier and a higher
>>    priority now. For now it seems there are really two outstanding items 
>> left:
>>    1. mapping .withReducers all the way down to the level of parallelism in
>>    the TezChild node in charge of performing that processing and 2. perhaps a
>>    planner bug, or perhaps a missing scalding-specific planner transform to
>>    handle jobs involving Sketched Joins *(that's on cascading 3.2-wip-6)*
>>    - Flink is not yet ready for prime time. At the moment, I'm building
>>    it using a local snapshot reflecting https://github.com/
>>    dataArtisans/cascading-flink/pull/70
>>    <https://github.com/dataArtisans/cascading-flink/pull/70> — This is
>>    required as some of Cascading's internal interfaces changed a bit since
>>    3.1.
>>    Some of the test are bound to fail for now, as cascading-flink cannot
>>    yet map some variants of hash joins (outer right hash joins, for 
>> instance).
>>    - Mode.scala is a mess and will need a little bit of clean-up
>>    - There are still a couple tests that are bound to fail
>>    - Any test that was doing pattern maching on the exact type of Mode
>>    (Test vs. Hadoop vs. Local vs. HadoopTest) *will* fail, and there is
>>    no solution
>>    - Tez and Flink *tests* seem quite slow. Not yet sure what's
>>    happening, it seems some of the code is simply waiting and waking up long
>>    after a given test job is complete.
>>
>> *The Ugly*
>>
>>    - Mode.scala is a mess and will *really* need a little bit of clean-up
>>
>>    - we still need to compile scalding-core with a *provided *dependency
>>    to either cascading-hadoop or cascading-hadoop2-mr1. This is due to
>>    HadoopTap and friends (HDFS support). Ideally we could have a (perhaps
>>    hard?) dependency on cascading-hadoop2-io since everyone's using it
>>    (hadoop2-mr1, tez, flink), but we'd have to manage the case of
>>    cascading-hadoop (which brings almost identical copies but cannot, by
>>    trade, depend on cascading-hadoop2-io). Still slightly confused on the 
>> best
>>    course of action; I'd like things in scalding-core to actually not compile
>>    if they still accidentally depend on MAPREDUCE. I'm unsure it's achievable
>>    as it is.
>>
>>    - I've tried to share the fabric-sensitive tests from scalding-core
>>    into a pool of tests that is shared and verified with all fabrics: this is
>>    scalding-core-fabric-tests
>>
>>    Although Scalatest's internal discovery seems to be happy with
>>    running anything that looks like a test, the discovery module used by "sbt
>>    test" is different. It only looks at tests that are implemented within the
>>    current project, specifically ignoring tests inherited from dependencies.
>>
>>    I failed to find a way to convince sbt to adopt scalatest's discovery
>>    pattern. As a result, I've moved the "shared" tests from 
>> scalding-core-fabric-tests
>>    into another subdirectory of src/, which is referenced by all four
>>    fabrics as their own. As a result, this code is compiled 4 times, and
>>    IntelliJ can be confused and refusing to step into that.
>>
>>    If there is an sbt guru around willing to give me a hand on this, I'd
>>    be really grateful.
>>
>>    - Making counter implementation dependent on the fabric required
>>    passing a class *name* into fabric-specific properties, then using
>>    reflection to instantiate them up.
>>    - The smart tricks needed to make JobTest work and mock out taps
>>    which can be LocalTap or HadoopTaps pretty much at will
>>    - I couldn't really wrap my head around enough of this without
>>    actually digging in, rather than planning/designing first. Some
>>    documentation and possibly a restart from scratch might be needed after 
>> all.
>>
>> Things I'm inclined to kick to "later": can we also abstract out storage
>> from "necessarily HDFS"? Is that something useful?
>>
>> On the other hand, as the (Storage Mode) x (Execution Mode) x (data
>> Scheme) support matrix can be daunting, it can be useful to still make the
>> assumption that everything is HDFS unless it's on LocalTaps which sometimes
>> can be HadoopTap-and-a-wrapper or the other way around.
>>
>> Next steps: incorporate feedback, clean up, fix outstanding issues in
>> scalding-fabric-tez, (fix in flink in due time), keep current with the
>> develop/cascading3 branches, then figure out how to mainstream that
>> (probably, indeed, breaking up what can be broken up into individual PRs,
>> but I'm afraid there will still be a big atomic change of something at one
>> point).
>>
>> For now that's just a branch, would it make sense to open an "RFC only"
>> PR to enable the review tools?
>>
>>     -- Cyrille
>>
>>
>> Le 14/10/2016 à 22:47, 'Alex Levenson' via Scalding Development a écrit :
>>
>> This is a large enough change, that probably won't fit into a single PR,
>> that it might merit some sort of design doc / written plan. That way we can
>> come up with a plan and then start implementing it piece by piece across a
>> few PRs.
>>
>> On Wed, Oct 12, 2016 at 2:11 PM, 'Oscar Boykin' via Scalding Development
>> <[email protected]> wrote:
>>
>> sounds great!
>>
>> On Tue, Oct 11, 2016 at 11:39 PM Cyrille Chépélov <
>> [email protected]> wrote:
>>
>> Oscar, Piyush,
>>
>> thanks for the feedback!
>>
>> At the moment, I'm not sure it's realistic to fully break the dependency
>> to "hadoop" completely out of scalding-core. As an intermediate goal, I'd
>> shoot for at least soft-removing the assumption that the *processing* is
>> made on Hadoop, but the storage interface will pretty much remain HDFS for
>> the time being (IOW, I'll leave Source essentially unchanged in
>> scalding-core).
>>
>> Meanwhile, I'm taking the messages here and on the gitter channel as
>> positive towards the principle of scalding-$FABRIC sub-modules, and will
>> start working on that in the background.
>>
>>
>>     -- Cyrille
>>
>>
>> Le 12/10/2016 à 03:29, 'Oscar Boykin' via Scalding Development a écrit :
>>
>> Generally, I think this is a good idea also (separate modules for
>> fabrics).
>>
>> I agree that Mode and Job are a bit hairy in spots. I think we can remove
>> some deprecated code if it makes life significantly easier, but source and
>> binary compatibility should be kept as much as we can reasonably manage.
>>
>> I would actually really rather `buildFlow` be private[scalding] but maybe
>> that is too much. Making it return a subclass of Flow seems like a fine
>> idea to me at the moment.
>>
>> Breaking hadoop out of scalding-core seems pretty hard since `Source` has
>> it baked in at a few spots. That said, the Source abstractions in scalding
>> are not very great. If we could improve that (without removing support for
>> the old stuff) it might be worth it. Many have complained about Source's
>> design over the years, but we have not really had a full proposal that
>> seems to address all the concerns.
>>
>> The desire for jobs to all look the same across all fabrics make
>> modularization a bit ugly.
>>
>> On Tue, Oct 11, 2016 at 2:23 PM 'Piyush Narang' via Scalding Development <
>> [email protected]> wrote:
>>
>> We ran into similar problems while trying to set the number of reducers
>> while testing out Cascading3 on Tez. We hacked around it temporarily
>> <https://github.com/twitter/scalding/commit/57983601c7db4ef1e0df3350140d473f371e6bb3>
>>  but
>> haven't yet cleaned up that code and put it out for review (we'll need to
>> fork MR / Tez there as nodeConfigDef works for Tez but not Hadoop). Based
>> on my understanding, so far we've tried to delegate as much of this to
>> Cascading as we can but there seem to be a few places where we're doing
>> some platform specific stuff in Scalding. Breaking up to create
>> fabric-specific sub-modules seems like a nice idea to me. We might need to
>> think through the right way to do this to ensure we don't break stuff.
>> Would it make sense to spin up an issue and we can discuss on it?
>>
>> On Tue, Oct 11, 2016 at 10:42 AM, Cyrille Chépélov <
>> [email protected]> wrote:
>>
>> Hi,
>>
>> I'm trying to tie a few loose ends in the way step descriptions (text
>> typically passed via *.withDescriptions(...)*) and the desired level of
>> parallelism (typically passed via *.withReducers(N)*) is pushed on the
>> various fabrics.
>>
>> Right now:
>>
>>    - Most of the scalding code base either ignores the back-end (good)
>>    or assumes
>>    
>> <https://github.com/twitter/scalding/blob/7ed0f92a946ad8407645695d3def62324f78ac41/scalding-core/src/main/scala/com/twitter/scalding/ExecutionContext.scala#L81>
>>    the world is either Local or HadoopFlow (which covers Hadoop 1.x and MR1).
>>    As a consequence, a couple things don't yet work smoothly on Tez and I
>>    assume on Flink.
>>    - the descriptions are entirely dropped if not running on Hadoop1 or
>>    MR1
>>    - .withReducers sets a hadoop-specific property (*mapred*.*reduce*.
>>    *tasks*) at RichPipe#L41
>>    
>> <https://github.com/twitter/scalding/blob/7ed0f92a946ad8407645695d3def62324f78ac41/scalding-core/src/main/scala/com/twitter/scalding/RichPipe.scala#L41>
>>    - the Tez fabric ignores .withReducers; and there is no other conduit
>>    (for now) to set the number of desired parts on the sinks. As a
>>    consequence, you can't run a tez DAG with a large level of parallelism and
>>    a small (single) number of output files (e.g. stats leading to a result
>>    file of a couple dozen lines); you must pick one and set
>>    *cascading.flow.runtime.gather.partitions.num*. There are
>>    workarounds, but they're quite ugly.
>>    - there are a few instance of "flow match { case HadoopFlow =>
>>    doSomething ; case _ => () }" scattered around the code
>>    - there's some heavily reflection-based code in Mode.scala
>>    
>> <https://github.com/twitter/scalding/blob/7ed0f92a946ad8407645695d3def62324f78ac41/scalding-core/src/main/scala/com/twitter/scalding/Mode.scala#L75>
>>    which depends on jars not part of the scalding build process (and it's 
>> good
>>    that these jars stay out of the scalding-core build, e.g. Tez client
>>    libraries)
>>    - While it may be desirable to experiment with scalding-specific
>>    transform registries for cascading (e.g. to deal with the Merge-GroupBy
>>    structure, or to perform tests/assertions on the resulting flow graph), it
>>    would be impractical to perform the necessary fabric-specific adjustments
>>    in Mode.scala as it is.
>>
>> I'm trying to find a way to extract away the MR-isms, and push it into
>> fabric-specific code which can be called when appropriate.
>>
>> Questions:
>>
>>    1. Would it be appropriate to start having fabric-specific jars
>>    (scalding-fabric-hadoop, scalding-fabric-hadoop2-mr1, scalding-fabric-tez
>>    etc.), push the fabric-specific code from Mode.scala there ?
>>
>>    (we'd keep only the single scalding fabric-related factory using
>>    reflection, with appropriate interfaces defined in scalding-core)
>>
>>    2. Pushing the fabric-specific code into dedicated jars would
>>    probably have user-visible consequences, as we can't make scalding-core
>>    depend on scalding-fabric-hadoop (for back-compatibility) unless the
>>    fabric-factory interface go into another jar.
>>
>>    From my point of view, I would find that intentionally slightly
>>    breaking the build once upon upgrade for the purpose of letting the world
>>    know that there are other fabrics than MR1 might be acceptable, and on the
>>    other hand I haven't used MR1 for over a year.
>>
>>    Is this "slight" dependency breakage acceptable, or is it better to
>>    have scalding-core still imply the hadoop fabrics?
>>
>>    3. Right now, scalding's internals sometimes use Hadoop (MR)
>>    specifics to carry various configuration values. Is it acceptable to (at
>>    least in the beginning) continue doing so, kindling asking the respective
>>    non-hadoop fabrics to pick these values up and convert to the relevant 
>> APIs?
>>
>>    4. Is it okay to drop the @deprecated(..., "0.12.0") functions from
>>    Mode.scala if they are inconvenient to carry over in the process?
>>
>>    5. Currently, Job.buildFlow
>>    
>> <https://github.com/twitter/scalding/blob/7ed0f92a946ad8407645695d3def62324f78ac41/scalding-core/src/main/scala/com/twitter/scalding/Job.scala#L223>
>>    returns Flow[_]. Is it okay to have it return Flow[_] with
>>    ScaldingFlowGoodies instead, ScaldingFlowGoodies being the
>>    provisional interface name where to move the old "flow match { case
>>    HadoopFlow => ... }" code?
>>
>> Thanks in advance
>>
>>     -- Cyrille
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Scalding Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
>>
>> --
>> - Piyush
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Scalding Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Scalding Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Scalding Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Scalding Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
>>
>> --
>> Alex Levenson
>> @THISWILLWORK
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Scalding Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Scalding Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Scalding Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>



-- 
- Piyush

-- 
You received this message because you are subscribed to the Google Groups 
"Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to