Re: [cascading3 branch] descriptions & parallelism

'Oscar Boykin' via Scalding Development Wed, 12 Oct 2016 14:12:07 -0700

sounds great!

On Tue, Oct 11, 2016 at 11:39 PM Cyrille Chépélov <
[email protected]> wrote:


> Oscar, Piyush,
>
> thanks for the feedback!
>
> At the moment, I'm not sure it's realistic to fully break the dependency
> to "hadoop" completely out of scalding-core. As an intermediate goal, I'd
> shoot for at least soft-removing the assumption that the *processing* is
> made on Hadoop, but the storage interface will pretty much remain HDFS for
> the time being (IOW, I'll leave Source essentially unchanged in
> scalding-core).
>
> Meanwhile, I'm taking the messages here and on the gitter channel as
> positive towards the principle of scalding-$FABRIC sub-modules, and will
> start working on that in the background.
>
>
>     -- Cyrille
>
>
> Le 12/10/2016 à 03:29, 'Oscar Boykin' via Scalding Development a écrit :
>
> Generally, I think this is a good idea also (separate modules for
> fabrics).
>
> I agree that Mode and Job are a bit hairy in spots. I think we can remove
> some deprecated code if it makes life significantly easier, but source and
> binary compatibility should be kept as much as we can reasonably manage.
>
> I would actually really rather `buildFlow` be private[scalding] but maybe
> that is too much. Making it return a subclass of Flow seems like a fine
> idea to me at the moment.
>
> Breaking hadoop out of scalding-core seems pretty hard since `Source` has
> it baked in at a few spots. That said, the Source abstractions in scalding
> are not very great. If we could improve that (without removing support for
> the old stuff) it might be worth it. Many have complained about Source's
> design over the years, but we have not really had a full proposal that
> seems to address all the concerns.
>
> The desire for jobs to all look the same across all fabrics make
> modularization a bit ugly.
>
> On Tue, Oct 11, 2016 at 2:23 PM 'Piyush Narang' via Scalding Development <
> [email protected]> wrote:
>
> We ran into similar problems while trying to set the number of reducers
> while testing out Cascading3 on Tez. We hacked around it temporarily
> <https://github.com/twitter/scalding/commit/57983601c7db4ef1e0df3350140d473f371e6bb3>
>  but
> haven't yet cleaned up that code and put it out for review (we'll need to
> fork MR / Tez there as nodeConfigDef works for Tez but not Hadoop). Based
> on my understanding, so far we've tried to delegate as much of this to
> Cascading as we can but there seem to be a few places where we're doing
> some platform specific stuff in Scalding. Breaking up to create
> fabric-specific sub-modules seems like a nice idea to me. We might need to
> think through the right way to do this to ensure we don't break stuff.
> Would it make sense to spin up an issue and we can discuss on it?
>
> On Tue, Oct 11, 2016 at 10:42 AM, Cyrille Chépélov <
> [email protected]> wrote:
>
> Hi,
>
> I'm trying to tie a few loose ends in the way step descriptions (text
> typically passed via *.withDescriptions(...)*) and the desired level of
> parallelism (typically passed via *.withReducers(N)*) is pushed on the
> various fabrics.
>
> Right now:
>
>    - Most of the scalding code base either ignores the back-end (good) or
>    assumes
>    
> <https://github.com/twitter/scalding/blob/7ed0f92a946ad8407645695d3def62324f78ac41/scalding-core/src/main/scala/com/twitter/scalding/ExecutionContext.scala#L81>
>    the world is either Local or HadoopFlow (which covers Hadoop 1.x and MR1).
>    As a consequence, a couple things don't yet work smoothly on Tez and I
>    assume on Flink.
>    - the descriptions are entirely dropped if not running on Hadoop1 or
>    MR1
>    - .withReducers sets a hadoop-specific property (*mapred*.*reduce*.
>    *tasks*) at RichPipe#L41
>    
> <https://github.com/twitter/scalding/blob/7ed0f92a946ad8407645695d3def62324f78ac41/scalding-core/src/main/scala/com/twitter/scalding/RichPipe.scala#L41>
>    - the Tez fabric ignores .withReducers; and there is no other conduit
>    (for now) to set the number of desired parts on the sinks. As a
>    consequence, you can't run a tez DAG with a large level of parallelism and
>    a small (single) number of output files (e.g. stats leading to a result
>    file of a couple dozen lines); you must pick one and set
>    *cascading.flow.runtime.gather.partitions.num*. There are workarounds,
>    but they're quite ugly.
>    - there are a few instance of "flow match { case HadoopFlow =>
>    doSomething ; case _ => () }" scattered around the code
>    - there's some heavily reflection-based code in Mode.scala
>    
> <https://github.com/twitter/scalding/blob/7ed0f92a946ad8407645695d3def62324f78ac41/scalding-core/src/main/scala/com/twitter/scalding/Mode.scala#L75>
>    which depends on jars not part of the scalding build process (and it's good
>    that these jars stay out of the scalding-core build, e.g. Tez client
>    libraries)
>    - While it may be desirable to experiment with scalding-specific
>    transform registries for cascading (e.g. to deal with the Merge-GroupBy
>    structure, or to perform tests/assertions on the resulting flow graph), it
>    would be impractical to perform the necessary fabric-specific adjustments
>    in Mode.scala as it is.
>
> I'm trying to find a way to extract away the MR-isms, and push it into
> fabric-specific code which can be called when appropriate.
>
> Questions:
>
>    1. Would it be appropriate to start having fabric-specific jars
>    (scalding-fabric-hadoop, scalding-fabric-hadoop2-mr1, scalding-fabric-tez
>    etc.), push the fabric-specific code from Mode.scala there ?
>
>    (we'd keep only the single scalding fabric-related factory using
>    reflection, with appropriate interfaces defined in scalding-core)
>
>    2. Pushing the fabric-specific code into dedicated jars would probably
>    have user-visible consequences, as we can't make scalding-core depend on
>    scalding-fabric-hadoop (for back-compatibility) unless the fabric-factory
>    interface go into another jar.
>
>    From my point of view, I would find that intentionally slightly
>    breaking the build once upon upgrade for the purpose of letting the world
>    know that there are other fabrics than MR1 might be acceptable, and on the
>    other hand I haven't used MR1 for over a year.
>
>    Is this "slight" dependency breakage acceptable, or is it better to
>    have scalding-core still imply the hadoop fabrics?
>
>    3. Right now, scalding's internals sometimes use Hadoop (MR) specifics
>    to carry various configuration values. Is it acceptable to (at least in the
>    beginning) continue doing so, kindling asking the respective non-hadoop
>    fabrics to pick these values up and convert to the relevant APIs?
>
>    4. Is it okay to drop the @deprecated(..., "0.12.0") functions from
>    Mode.scala if they are inconvenient to carry over in the process?
>
>    5. Currently, Job.buildFlow
>    
> <https://github.com/twitter/scalding/blob/7ed0f92a946ad8407645695d3def62324f78ac41/scalding-core/src/main/scala/com/twitter/scalding/Job.scala#L223>
>    returns Flow[_]. Is it okay to have it return Flow[_] with
>    ScaldingFlowGoodies instead, ScaldingFlowGoodies being the provisional
>    interface name where to move the old "flow match { case HadoopFlow =>
>    ... }" code?
>
> Thanks in advance
>
>     -- Cyrille
>
> --
> You received this message because you are subscribed to the Google Groups
> "Scalding Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>
>
>
>
> --
> - Piyush
> --
> You received this message because you are subscribed to the Google Groups
> "Scalding Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Scalding Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Scalding Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [cascading3 branch] descriptions & parallelism

Reply via email to