[jira] [Updated] (BEAM-10148) Data loaded using ReadAllFromText is not projected properly as side input

Prathap Kumar Parvathareddy (Jira) Thu, 28 May 2020 16:16:30 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-10148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Prathap Kumar Parvathareddy updated BEAM-10148:
-----------------------------------------------
    Description: 
*Context:*

Data Enrichment Pattern with 2 Sources. 

Source 1:  Google Cloud PubSub delivering JSON messages

Source 2:  Google Cloud Storage Files (acts as a Side Input) 

*Steps:* 

1. Load the data from GCS (Google Cloud Storage) using ReadAllFromText based on 
file path inside a PCollection and convert each record as a tuple with 2 fields

2. Project the Tuple loaded in step 1 as a side input using Pvalue.ASDict to 
the main input that is being loaded from PubSub.

3. Expectation is that the side inputs should be available but for some reason 
AsDict is not containing all the data that was loaded from GCS

 

*Possible Issues:*

      Below are few possible issues that can be ruled out as I already 
validated them.
 # Window Mismatches -  Main Input window is within the scope of Side Input 
window.
 # Delay in updating Side Input State -  Pipeline has just 1 VM and Side Input 
has only 8 json messages and total size of side input is around 40 KB

 

*Troubleshooting:*
 # Working fine in a DirectRunner   
 # Validated that ReadAllFromText transform is loading all the data properly 
from multiple files with subsequent transform building KV as well. However when 
the output of KV transform is used as a sideinput using ASDict() for some 
reason certain elements are skipped and not available inside look up.

*Code* 

  Will update the link pointing to complete code shortly which helps in 
providing more visibility.

 

 

 

  was:
*Context:*

Data Enrichment Pattern with 2 Sources. 

Source 1:  Google Cloud PubSub delivering JSON messages

Source 2:  Google Cloud Storage Files (acts as a Side Input) 

*Steps:* 

1. Load the data from GCS (Google Cloud Storage) using ReadAllFromText based on 
file path inside a PCollection and convert each record as a tuple with 2 fields

2. Project the Tuple loaded in step 1 as a side input using Pvalue.ASDict to 
the main input that is being loaded from PubSub.

3. Expectation is that the side inputs should be available but for some reason 
AsDict is not containing all the data that was loaded from GCS

*Possible Issues:*
 # Window Mismatches -  Main Input window is within the scope of Side Input 
window.
 # Delay in updating Side Input State -  Pipeline has just 1 VM and Side Input 
has only 8 json messages and total size of side input is around 40 KB

 

*Troubleshooting:*
 # Working fine in a DirectRunner   
 # Validated that ReadAllFromText transform is loading all the data properly 
from multiple files with subsequent transform building KV as well. However when 
the output of KV transform is used as a sideinput using ASDict() for some 
reason certain elements are skipped and not available inside look up.

*Code* 

  Will update the link pointing to complete code shortly which helps in 
providing more visibility.

 

 

 


> Data loaded using ReadAllFromText is not projected properly as side input
> -------------------------------------------------------------------------
>
>                 Key: BEAM-10148
>                 URL: https://issues.apache.org/jira/browse/BEAM-10148
>             Project: Beam
>          Issue Type: Bug
>          Components: io-py-files
>    Affects Versions: 2.19.0
>         Environment: Runner: Dataflow Runner
> Beam SDK:  Python 2.19
>            Reporter: Prathap Kumar Parvathareddy
>            Priority: P2
>
> *Context:*
> Data Enrichment Pattern with 2 Sources. 
> Source 1:  Google Cloud PubSub delivering JSON messages
> Source 2:  Google Cloud Storage Files (acts as a Side Input) 
> *Steps:* 
> 1. Load the data from GCS (Google Cloud Storage) using ReadAllFromText based 
> on file path inside a PCollection and convert each record as a tuple with 2 
> fields
> 2. Project the Tuple loaded in step 1 as a side input using Pvalue.ASDict to 
> the main input that is being loaded from PubSub.
> 3. Expectation is that the side inputs should be available but for some 
> reason AsDict is not containing all the data that was loaded from GCS
>  
> *Possible Issues:*
>       Below are few possible issues that can be ruled out as I already 
> validated them.
>  # Window Mismatches -  Main Input window is within the scope of Side Input 
> window.
>  # Delay in updating Side Input State -  Pipeline has just 1 VM and Side 
> Input has only 8 json messages and total size of side input is around 40 KB
>  
> *Troubleshooting:*
>  # Working fine in a DirectRunner   
>  # Validated that ReadAllFromText transform is loading all the data properly 
> from multiple files with subsequent transform building KV as well. However 
> when the output of KV transform is used as a sideinput using ASDict() for 
> some reason certain elements are skipped and not available inside look up.
> *Code* 
>   Will update the link pointing to complete code shortly which helps in 
> providing more visibility.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-10148) Data loaded using ReadAllFromText is not projected properly as side input

Reply via email to