sounds great!

On Tue, Oct 11, 2016 at 11:39 PM Cyrille Chépélov <
c...@transparencyrights.com> wrote:

> Oscar, Piyush,
>
> thanks for the feedback!
>
> At the moment, I'm not sure it's realistic to fully break the dependency
> to "hadoop" completely out of scalding-core. As an intermediate goal, I'd
> shoot for at least soft-removing the assumption that the *processing* is
> made on Hadoop, but the storage interface will pretty much remain HDFS for
> the time being (IOW, I'll leave Source essentially unchanged in
> scalding-core).
>
> Meanwhile, I'm taking the messages here and on the gitter channel as
> positive towards the principle of scalding-$FABRIC sub-modules, and will
> start working on that in the background.
>
>
>     -- Cyrille
>
>
> Le 12/10/2016 à 03:29, 'Oscar Boykin' via Scalding Development a écrit :
>
> Generally, I think this is a good idea also (separate modules for
> fabrics).
>
> I agree that Mode and Job are a bit hairy in spots. I think we can remove
> some deprecated code if it makes life significantly easier, but source and
> binary compatibility should be kept as much as we can reasonably manage.
>
> I would actually really rather `buildFlow` be private[scalding] but maybe
> that is too much. Making it return a subclass of Flow seems like a fine
> idea to me at the moment.
>
> Breaking hadoop out of scalding-core seems pretty hard since `Source` has
> it baked in at a few spots. That said, the Source abstractions in scalding
> are not very great. If we could improve that (without removing support for
> the old stuff) it might be worth it. Many have complained about Source's
> design over the years, but we have not really had a full proposal that
> seems to address all the concerns.
>
> The desire for jobs to all look the same across all fabrics make
> modularization a bit ugly.
>
> On Tue, Oct 11, 2016 at 2:23 PM 'Piyush Narang' via Scalding Development <
> scalding-dev@googlegroups.com> wrote:
>
> We ran into similar problems while trying to set the number of reducers
> while testing out Cascading3 on Tez. We hacked around it temporarily
> <https://github.com/twitter/scalding/commit/57983601c7db4ef1e0df3350140d473f371e6bb3>
>  but
> haven't yet cleaned up that code and put it out for review (we'll need to
> fork MR / Tez there as nodeConfigDef works for Tez but not Hadoop). Based
> on my understanding, so far we've tried to delegate as much of this to
> Cascading as we can but there seem to be a few places where we're doing
> some platform specific stuff in Scalding. Breaking up to create
> fabric-specific sub-modules seems like a nice idea to me. We might need to
> think through the right way to do this to ensure we don't break stuff.
> Would it make sense to spin up an issue and we can discuss on it?
>
> On Tue, Oct 11, 2016 at 10:42 AM, Cyrille Chépélov <
> c...@transparencyrights.com> wrote:
>
> Hi,
>
> I'm trying to tie a few loose ends in the way step descriptions (text
> typically passed via *.withDescriptions(...)*) and the desired level of
> parallelism (typically passed via *.withReducers(N)*) is pushed on the
> various fabrics.
>
> Right now:
>
>    - Most of the scalding code base either ignores the back-end (good) or
>    assumes
>    
> <https://github.com/twitter/scalding/blob/7ed0f92a946ad8407645695d3def62324f78ac41/scalding-core/src/main/scala/com/twitter/scalding/ExecutionContext.scala#L81>
>    the world is either Local or HadoopFlow (which covers Hadoop 1.x and MR1).
>    As a consequence, a couple things don't yet work smoothly on Tez and I
>    assume on Flink.
>    - the descriptions are entirely dropped if not running on Hadoop1 or
>    MR1
>    - .withReducers sets a hadoop-specific property (*mapred*.*reduce*.
>    *tasks*) at RichPipe#L41
>    
> <https://github.com/twitter/scalding/blob/7ed0f92a946ad8407645695d3def62324f78ac41/scalding-core/src/main/scala/com/twitter/scalding/RichPipe.scala#L41>
>    - the Tez fabric ignores .withReducers; and there is no other conduit
>    (for now) to set the number of desired parts on the sinks. As a
>    consequence, you can't run a tez DAG with a large level of parallelism and
>    a small (single) number of output files (e.g. stats leading to a result
>    file of a couple dozen lines); you must pick one and set
>    *cascading.flow.runtime.gather.partitions.num*. There are workarounds,
>    but they're quite ugly.
>    - there are a few instance of "flow match { case HadoopFlow =>
>    doSomething ; case _ => () }" scattered around the code
>    - there's some heavily reflection-based code in Mode.scala
>    
> <https://github.com/twitter/scalding/blob/7ed0f92a946ad8407645695d3def62324f78ac41/scalding-core/src/main/scala/com/twitter/scalding/Mode.scala#L75>
>    which depends on jars not part of the scalding build process (and it's good
>    that these jars stay out of the scalding-core build, e.g. Tez client
>    libraries)
>    - While it may be desirable to experiment with scalding-specific
>    transform registries for cascading (e.g. to deal with the Merge-GroupBy
>    structure, or to perform tests/assertions on the resulting flow graph), it
>    would be impractical to perform the necessary fabric-specific adjustments
>    in Mode.scala as it is.
>
> I'm trying to find a way to extract away the MR-isms, and push it into
> fabric-specific code which can be called when appropriate.
>
> Questions:
>
>    1. Would it be appropriate to start having fabric-specific jars
>    (scalding-fabric-hadoop, scalding-fabric-hadoop2-mr1, scalding-fabric-tez
>    etc.), push the fabric-specific code from Mode.scala there ?
>
>    (we'd keep only the single scalding fabric-related factory using
>    reflection, with appropriate interfaces defined in scalding-core)
>
>    2. Pushing the fabric-specific code into dedicated jars would probably
>    have user-visible consequences, as we can't make scalding-core depend on
>    scalding-fabric-hadoop (for back-compatibility) unless the fabric-factory
>    interface go into another jar.
>
>    From my point of view, I would find that intentionally slightly
>    breaking the build once upon upgrade for the purpose of letting the world
>    know that there are other fabrics than MR1 might be acceptable, and on the
>    other hand I haven't used MR1 for over a year.
>
>    Is this "slight" dependency breakage acceptable, or is it better to
>    have scalding-core still imply the hadoop fabrics?
>
>    3. Right now, scalding's internals sometimes use Hadoop (MR) specifics
>    to carry various configuration values. Is it acceptable to (at least in the
>    beginning) continue doing so, kindling asking the respective non-hadoop
>    fabrics to pick these values up and convert to the relevant APIs?
>
>    4. Is it okay to drop the @deprecated(..., "0.12.0") functions from
>    Mode.scala if they are inconvenient to carry over in the process?
>
>    5. Currently, Job.buildFlow
>    
> <https://github.com/twitter/scalding/blob/7ed0f92a946ad8407645695d3def62324f78ac41/scalding-core/src/main/scala/com/twitter/scalding/Job.scala#L223>
>    returns Flow[_]. Is it okay to have it return Flow[_] with
>    ScaldingFlowGoodies instead, ScaldingFlowGoodies being the provisional
>    interface name where to move the old "flow match { case HadoopFlow =>
>    ... }" code?
>
> Thanks in advance
>
>     -- Cyrille
>
> --
> You received this message because you are subscribed to the Google Groups
> "Scalding Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to scalding-dev+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>
>
>
> --
> - Piyush
> --
> You received this message because you are subscribed to the Google Groups
> "Scalding Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to scalding-dev+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Scalding Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to scalding-dev+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Scalding Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to scalding-dev+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scalding-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to