[ 
https://issues.apache.org/jira/browse/BEAM-7864?focusedWorklogId=301854&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-301854
 ]

ASF GitHub Bot logged work on BEAM-7864:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 27/Aug/19 10:05
            Start Date: 27/Aug/19 10:05
    Worklog Time Spent: 10m 
      Work Description: RyanSkraba commented on issue #9410: [BEAM-7864] 
Simplify/generalize Spark reshuffle translation
URL: https://github.com/apache/beam/pull/9410#issuecomment-525234062
 
 
   Still LGTM -- as I mentioned, I'm not entirely sure *why* the original 
implementation is done as it is!
   
   It seems that there could be some useful discussion around `Reshuffle` and 
it's contract -- if it weren't deprecated!
   
   As far as I can tell, the *intention* of `Reshuffle.of` is to ensure that 
the "upstream" is materialized, and if upstream is ever rematerialized for any 
reason, the resulting partitions are deterministic.  We don't care whether 
partition sizes are skewed.
   
   The reference implementation implies that records with the same key are in 
the same partition, but the spark replacement has never done this.  It also 
rebalances the partitions (deterministically) whether we care or not.
   
   The *intention* of `Reshuffle.viaRandomKey` is to add balancing the 
partitions.  We don't care whether the results are deterministic.
   
   The spark implementation rebalances the partitions, but it would have done 
that anyway if the entire `.viaRandomKey` were replaced with an `.of`.  The 
entire `viaRandomKey()` translation is extra unnecessary cruft in Spark 
_unless_ random repartitioning is a requirement.  Is it?
   
   `I want materialization of partitions of approximately the same size`  does 
not mean `I need the data to be randomly assigned to partitions.`
   
   *Anyway* sorry for the sidetrack!  I just feel like I might be missing a 
piece here and would welcome clarity  :D  In my opinion, all of the current 
usages of `Reshuffle.of()` and `viaRandomKey()` are valid with this 
re-implementation, except for the Deduplicate in GDF which isn't relevant.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 301854)
    Time Spent: 1h 50m  (was: 1h 40m)

> Portable Spark Reshuffle coder cast exception
> ---------------------------------------------
>
>                 Key: BEAM-7864
>                 URL: https://issues.apache.org/jira/browse/BEAM-7864
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-spark
>            Reporter: Kyle Weaver
>            Assignee: Kyle Weaver
>            Priority: Major
>              Labels: portability-spark
>          Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> running :sdks:python:test-suites:portable:py35:portableWordCountBatch in 
> either loopback or docker mode on master fails with exception:
>  
> java.lang.ClassCastException: org.apache.beam.sdk.coders.LengthPrefixCoder 
> cannot be cast to org.apache.beam.sdk.coders.KvCoder
>  at 
> org.apache.beam.runners.spark.translation.SparkBatchPortablePipelineTranslator.translateReshuffle(SparkBatchPortablePipelineTranslator.java:400)
>  at 
> org.apache.beam.runners.spark.translation.SparkBatchPortablePipelineTranslator.translate(SparkBatchPortablePipelineTranslator.java:147)
>  at 
> org.apache.beam.runners.spark.SparkPipelineRunner.lambda$run$1(SparkPipelineRunner.java:96)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to