I did a pull request that contains my suggestions. This also fixes the problem that you have multiple repartition operators if you do several map operators on a repartition. Each one of them would get their own private repartition.
What do you guys think? On Thu, Sep 25, 2014 at 10:44 AM, Fabian Hueske <[email protected]> wrote: > I wanted the change to be as less invasive as possible and there were a few > tricky details. > For example the DOP of a partition operator must be the same as the > following operator (otherwise data is reshuffled again). > Also multiple consumers of a repartition are a nasty detail (esp. if they > have different DOPs). > > These things were a bit tricky to integrate into the current translation > mechanism without major changes. > Right now, each Map op injects its own partitioning operator before itself > and sets the DOP to its own DOP. > That's why the logic for that was put into the Map ops. A drawback of this > approach is, that there is no reuse of repartitionings even if it would be > possible... > > There might be smarter ways to implement this, but I think my (limited) > solution works correct in all cornercase that came to my mind. > > 2014-09-25 10:27 GMT+02:00 Aljoscha Krettek <[email protected]>: > >> Yes, I see your points on why you limited it to the map-style >> operators. My gripe with it is that it moves knowledge about the >> Partition Operator into some operators that now need special-case >> handling and that these operations are duplicated on >> PartitionedDataSet. If we have Partition as just another operator >> there would be no special-case handling and the integration would be >> scalable (in a sense that we don't have to touch other operators, or >> have to add partition support if we add another map-style operator). >> >> My view is that users should know what they are doing if the add a >> repartition before a reduce. In the future the optimizer could catch >> those cases and merge a custom repartition with a reduce if there are >> no other consumers of the repartitioning. >> >> Cheers, >> Aljoscha >> >> On Thu, Sep 25, 2014 at 10:13 AM, Fabian Hueske <[email protected]> >> wrote: >> > Hi, >> > >> > I limited explicit repartitioning to map-like operatators because Map >> > operations should be by far the most common use cases (rebalancing for >> > expensive mappers, repartition for partitionMap, ...) and I doubt the >> > benefits of repartitioning before other operators. >> > Right now, partitioning is implemented as an own runtime operator with >> > noop-driver. Hence, there would be a serialization overhead if subsequent >> > operators cannot be chained (i.e., join). >> > Also having a partitioning in front of a combinable reduce would make >> > things complicated, because the partitioning should be moved between >> > combiner and reducer. >> > Overall, I think, there are not many benefits of offering repartitioning >> > for all operators and if you still want it, you can use an identity >> mapper. >> > >> > Does that make sense? >> > If not, we can also change the implementation without breaking the API >> and >> > offer repartitioning for all operators. ;-) >> > >> > Anyway, I'm with you that we should think about how to integrate physical >> > operators (partition, sort, combine) into the API. >> > >> > Cheers, Fabian >> > >> > >> > 2014-09-24 17:00 GMT+02:00 Aljoscha Krettek <[email protected]>: >> > >> >> Hi, >> >> (mostly @fabian) why is the re-partition operator special-cased like >> >> it is now? You can only do map/filter on partitioned data. Wouldn't it >> >> be nice if we had a general re-partition operator. Operators that >> >> normally do their own repartitioning would notice that the data is >> >> already partitioned and use that. The way it is now, the >> >> implementation relies heavily on special-case handling. >> >> >> >> In the long run we could even introduce combine and sort as special >> >> operators that users could insert themselves. The optimiser would then >> >> also insert these before operations when required. This would >> >> simplify/generalise things a bit. >> >> >> >> What do you think? >> >> >> >> Cheers, >> >> Aljoscha >> >> >>
