[
https://issues.apache.org/jira/browse/BEAM-10148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17548624#comment-17548624
]
Danny McCormick commented on BEAM-10148:
----------------------------------------
This issue has been migrated to https://github.com/apache/beam/issues/20247
> Data loaded using ReadAllFromText is not projected properly as side input
> -------------------------------------------------------------------------
>
> Key: BEAM-10148
> URL: https://issues.apache.org/jira/browse/BEAM-10148
> Project: Beam
> Issue Type: Bug
> Components: io-py-files
> Affects Versions: 2.19.0
> Environment: Runner: Dataflow Runner
> Beam SDK: Python 2.19
> Reporter: Prathap Kumar Parvathareddy
> Priority: P3
>
> *Context:*
> Data Enrichment Pattern with 2 Sources.
> Source 1: Google Cloud PubSub delivering JSON messages
> Source 2: Google Cloud Storage Files (acts as a Side Input)
> *Steps:*
> 1. Load the data from GCS (Google Cloud Storage) using ReadAllFromText based
> on file path inside a PCollection and convert each record as a tuple with 2
> fields
> 2. Project the Tuple loaded in step 1 as a side input using Pvalue.ASDict to
> the main input that is being loaded from PubSub.
> 3. Expectation is that the side inputs should be available but for some
> reason AsDict is not containing all the data that was loaded from GCS
>
> *Possible Issues:*
> Below are few possible issues that can be ruled out as I already
> validated them.
> # Window Mismatches - Main Input window is within the scope of Side Input
> window.
> # Delay in updating Side Input State - Pipeline has just 1 VM and Side
> Input has only 8 json messages and total size of side input is around 40 KB
>
> *Troubleshooting:*
> # Working fine in a DirectRunner
> # Validated that ReadAllFromText transform is loading all the data properly
> from multiple files with subsequent transform building KV as well. However
> when the output of KV transform is used as a sideinput using ASDict() for
> some reason certain elements are skipped and not available inside look up.
> *Code*
> [Example
> Code|https://github.com/GoogleCloudPlatform/professional-services/blob/master/examples/dataflow-python-examples/streaming-examples/slowlychanging-sideinput/sideinput_refresh/main.py#L234]
> for reference
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)