[GitHub] [beam] mosche commented on a diff in pull request #24009: [Spark dataset runner] Cache datasets if used multiple times

GitBox Fri, 18 Nov 2022 01:34:05 -0800


mosche commented on code in PR #24009:
URL: https://github.com/apache/beam/pull/24009#discussion_r1026211619



##########
runners/spark/3/src/main/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/PipelineTranslatorBatch.java:
##########
@@ -81,27 +80,13 @@ public class PipelineTranslatorBatch extends 
PipelineTranslator {
 
     TRANSFORM_TRANSLATORS.put(
         SplittableParDo.PrimitiveBoundedRead.class, new 
ReadSourceTranslatorBatch<>());
-

Review Comment:
   This is unrelated to #24035, see comment below
   > PCollectionView view translation just stored the same Spark dataset 
(reference!) again for a different PTransform. That's obviously problematic for 
caching as we're not gathering metadata on that dataset in a single place. 
Also, beam runner guidelines discourage translation of PCollectionView, they 
are just there for legacy reasons.
   
   In terms of prep for #24035, that's mostly the introduction of 
`TranslationResult` to capture all kinds of metadata / context on a specific 
Spark dataset. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] mosche commented on a diff in pull request #24009: [Spark dataset runner] Cache datasets if used multiple times

Reply via email to