Oscar, Piyush,

thanks for the feedback!

At the moment, I'm not sure it's realistic to fully break the dependency to "hadoop" completely out of scalding-core. As an intermediate goal, I'd shoot for at least soft-removing the assumption that the /processing/ is made on Hadoop, but the storage interface will pretty much remain HDFS for the time being (IOW, I'll leave Source essentially unchanged in scalding-core).

Meanwhile, I'm taking the messages here and on the gitter channel as positive towards the principle of scalding-$FABRIC sub-modules, and will start working on that in the background.

    -- Cyrille

Le 12/10/2016 à 03:29, 'Oscar Boykin' via Scalding Development a écrit :
Generally, I think this is a good idea also (separate modules for fabrics).

I agree that Mode and Job are a bit hairy in spots. I think we can remove some deprecated code if it makes life significantly easier, but source and binary compatibility should be kept as much as we can reasonably manage.

I would actually really rather `buildFlow` be private[scalding] but maybe that is too much. Making it return a subclass of Flow seems like a fine idea to me at the moment.

Breaking hadoop out of scalding-core seems pretty hard since `Source` has it baked in at a few spots. That said, the Source abstractions in scalding are not very great. If we could improve that (without removing support for the old stuff) it might be worth it. Many have complained about Source's design over the years, but we have not really had a full proposal that seems to address all the concerns.

The desire for jobs to all look the same across all fabrics make modularization a bit ugly.

On Tue, Oct 11, 2016 at 2:23 PM 'Piyush Narang' via Scalding Development <scalding-dev@googlegroups.com <mailto:scalding-dev@googlegroups.com>> wrote:

    We ran into similar problems while trying to set the number of
    reducers while testing out Cascading3 on Tez. We hacked around it
    temporarily
    
<https://github.com/twitter/scalding/commit/57983601c7db4ef1e0df3350140d473f371e6bb3>
 but
    haven't yet cleaned up that code and put it out for review (we'll
    need to fork MR / Tez there as nodeConfigDef works for Tez but not
    Hadoop). Based on my understanding, so far we've tried to delegate
    as much of this to Cascading as we can but there seem to be a few
    places where we're doing some platform specific stuff in Scalding.
    Breaking up to create fabric-specific sub-modules seems like a
    nice idea to me. We might need to think through the right way to
    do this to ensure we don't break stuff. Would it make sense to
    spin up an issue and we can discuss on it?

    On Tue, Oct 11, 2016 at 10:42 AM, Cyrille Chépélov
    <c...@transparencyrights.com <mailto:c...@transparencyrights.com>>
    wrote:

        Hi,

        I'm trying to tie a few loose ends in the way step
        descriptions (text typically passed via
        *.withDescriptions(...)*) and the desired level of parallelism
        (typically passed via *.withReducers(N)*) is pushed on the
        various fabrics.

        Right now:

          * Most of the scalding code base either ignores the back-end
            (good) or assumes
            
<https://github.com/twitter/scalding/blob/7ed0f92a946ad8407645695d3def62324f78ac41/scalding-core/src/main/scala/com/twitter/scalding/ExecutionContext.scala#L81>
            the world is either Local or HadoopFlow (which covers
            Hadoop 1.x and MR1). As a consequence, a couple things
            don't yet work smoothly on Tez and I assume on Flink.
          * the descriptions are entirely dropped if not running on
            Hadoop1 or MR1
          * .withReducers sets a hadoop-specific property
            (/mapred/./reduce/./tasks/) at RichPipe#L41
            
<https://github.com/twitter/scalding/blob/7ed0f92a946ad8407645695d3def62324f78ac41/scalding-core/src/main/scala/com/twitter/scalding/RichPipe.scala#L41>
          * the Tez fabric ignores .withReducers; and there is no
            other conduit (for now) to set the number of desired parts
            on the sinks. As a consequence, you can't run a tez DAG
            with a large level of parallelism and a small (single)
            number of output files (e.g. stats leading to a result
            file of a couple dozen lines); you must pick one and set
            *cascading.flow.runtime.gather.partitions.num*. There are
            workarounds, but they're quite ugly.
          * there are a few instance of "flow match { case HadoopFlow
            => doSomething ; case _ => () }" scattered around the code
          * there's some heavily reflection-based code in Mode.scala
            
<https://github.com/twitter/scalding/blob/7ed0f92a946ad8407645695d3def62324f78ac41/scalding-core/src/main/scala/com/twitter/scalding/Mode.scala#L75>
            which depends on jars not part of the scalding build
            process (and it's good that these jars stay out of the
            scalding-core build, e.g. Tez client libraries)
          * While it may be desirable to experiment with
            scalding-specific transform registries for cascading (e.g.
            to deal with the Merge-GroupBy structure, or to perform
            tests/assertions on the resulting flow graph), it would be
            impractical to perform the necessary fabric-specific
            adjustments in Mode.scala as it is.

        I'm trying to find a way to extract away the MR-isms, and push
        it into fabric-specific code which can be called when appropriate.

        Questions:

         1. Would it be appropriate to start having fabric-specific
            jars (scalding-fabric-hadoop, scalding-fabric-hadoop2-mr1,
            scalding-fabric-tez etc.), push the fabric-specific code
            from Mode.scala there ?

            (we'd keep only the single scalding fabric-related factory
            using reflection, with appropriate interfaces defined in
            scalding-core)

         2. Pushing the fabric-specific code into dedicated jars would
            probably have user-visible consequences, as we can't make
            scalding-core depend on scalding-fabric-hadoop (for
            back-compatibility) unless the fabric-factory interface go
            into another jar.

            From my point of view, I would find that intentionally
            slightly breaking the build once upon upgrade for the
            purpose of letting the world know that there are other
            fabrics than MR1 might be acceptable, and on the other
            hand I haven't used MR1 for over a year.

            Is this "slight" dependency breakage acceptable, or is it
            better to have scalding-core still imply the hadoop fabrics?

         3. Right now, scalding's internals sometimes use Hadoop (MR)
            specifics to carry various configuration values. Is it
            acceptable to (at least in the beginning) continue doing
            so, kindling asking the respective non-hadoop fabrics to
            pick these values up and convert to the relevant APIs?

         4. Is it okay to drop the @deprecated(..., "0.12.0")
            functions from Mode.scala if they are inconvenient to
            carry over in the process?

         5. Currently, Job.buildFlow
            
<https://github.com/twitter/scalding/blob/7ed0f92a946ad8407645695d3def62324f78ac41/scalding-core/src/main/scala/com/twitter/scalding/Job.scala#L223>
            returns Flow[_]. Is it okay to have it return Flow[_] with
            ScaldingFlowGoodies instead, ScaldingFlowGoodies being the
            provisional interface name where to move the old "flow
            match { case HadoopFlow => ... }" code?

        Thanks in advance

            -- Cyrille

-- You received this message because you are subscribed to the
        Google Groups "Scalding Development" group.
        To unsubscribe from this group and stop receiving emails from
        it, send an email to scalding-dev+unsubscr...@googlegroups.com
        <mailto:scalding-dev+unsubscr...@googlegroups.com>.
        For more options, visit https://groups.google.com/d/optout.




-- - Piyush -- You received this message because you are subscribed to the Google
    Groups "Scalding Development" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to scalding-dev+unsubscr...@googlegroups.com
    <mailto:scalding-dev+unsubscr...@googlegroups.com>.
    For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev+unsubscr...@googlegroups.com <mailto:scalding-dev+unsubscr...@googlegroups.com>.
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "Scalding 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scalding-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to