Olivier NGUYEN QUOC created BEAM-2490:
-----------------------------------------

             Summary: ReadFromText function not taking all data with glob 
operator (*) 
                 Key: BEAM-2490
                 URL: https://issues.apache.org/jira/browse/BEAM-2490
             Project: Beam
          Issue Type: Bug
          Components: sdk-py
    Affects Versions: 2.0.0
         Environment: Usage with Google Cloud Platform: Dataflow runner
            Reporter: Olivier NGUYEN QUOC
            Assignee: Ahmet Altay


I run a very simple pipeline:
* Read my files from storage
* Split with '\n' char
* Write in on a Google Cloud Storage

I have 6 files matching with the pattern:
* my_files_2016090116_20160902_060051_xxxxxxxxxx.csv.gz (229.25 MB)
* my_files_2016090117_20160902_060051_xxxxxxxxxx.csv.gz (184.1 MB)
* my_files_2016090118_20160902_060051_xxxxxxxxxx.csv.gz (171.73 MB)
* my_files_2016090119_20160902_060051_xxxxxxxxxx.csv.gz (151.34 MB)
* my_files_2016090120_20160902_060051_xxxxxxxxxx.csv.gz (129.69 MB)
* my_files_2016090121_20160902_060051_xxxxxxxxxx.csv.gz (151.7 MB)
* my_files_2016090122_20160902_060051_xxxxxxxxxx.csv.gz (346.46 MB)
* my_files_2016090122_20160902_060051_xxxxxxxxxx.csv.gz (222.57 MB)

It run well but there is only a 288.62 MB file in output of this pipeline 
(instead of a 1.5 GB file).

{code:python}
data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
          "gs://XXXX_folder1/my_files_20160901*.csv.gz",
          skip_header_lines=1,
          compression_type=beam.io.filesystem.CompressionTypes.GZIP
          )
                       | 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
                    )
output = (
          data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
num_shards=1)
            )
{code}

Dataflow indicate me that the estimated size    of the output after the 
ReadFromText step is 602.29 MB only, which not correspond to any unique input 
file size nor the overall file size matching with the pattern.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to