[ 
https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olivier NGUYEN QUOC updated BEAM-2490:
--------------------------------------
    Description: 
I run a very simple pipeline:
* Read my files from Google Cloud Storage
* Split with '\n' char
* Write in on a Google Cloud Storage

I have 6 files matching with the pattern:
* my_files_2016090116_20160902_060051_xxxxxxxxxx.csv.gz (229.25 MB)
* my_files_2016090117_20160902_060051_xxxxxxxxxx.csv.gz (184.1 MB)
* my_files_2016090118_20160902_060051_xxxxxxxxxx.csv.gz (171.73 MB)
* my_files_2016090119_20160902_060051_xxxxxxxxxx.csv.gz (151.34 MB)
* my_files_2016090120_20160902_060051_xxxxxxxxxx.csv.gz (129.69 MB)
* my_files_2016090121_20160902_060051_xxxxxxxxxx.csv.gz (151.7 MB)
* my_files_2016090122_20160902_060051_xxxxxxxxxx.csv.gz (346.46 MB)
* my_files_2016090122_20160902_060051_xxxxxxxxxx.csv.gz (222.57 MB)

This code should take them all:

{code:python}
beam.io.ReadFromText(
      "gs://XXXX_folder1/my_files_20160901*.csv.gz",
      skip_header_lines=1,
      compression_type=beam.io.filesystem.CompressionTypes.GZIP
      )
{code}

It runs well but there is only a 288.62 MB file in output of this pipeline 
(instead of a 1.5 GB file).

The whole pipeline code:

{code:python}
data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
          "gs://XXXX_folder1/my_files_20160901*.csv.gz",
          skip_header_lines=1,
          compression_type=beam.io.filesystem.CompressionTypes.GZIP
          )
                       | 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
                    )
output = (
          data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
num_shards=1)
            )
{code}

Dataflow indicates me that the estimated size   of the output after the 
ReadFromText step is 602.29 MB only, which not correspond to any unique input 
file size nor the overall file size matching with the pattern.


  was:
I run a very simple pipeline:
* Read my files from storage
* Split with '\n' char
* Write in on a Google Cloud Storage

I have 6 files matching with the pattern:
* my_files_2016090116_20160902_060051_xxxxxxxxxx.csv.gz (229.25 MB)
* my_files_2016090117_20160902_060051_xxxxxxxxxx.csv.gz (184.1 MB)
* my_files_2016090118_20160902_060051_xxxxxxxxxx.csv.gz (171.73 MB)
* my_files_2016090119_20160902_060051_xxxxxxxxxx.csv.gz (151.34 MB)
* my_files_2016090120_20160902_060051_xxxxxxxxxx.csv.gz (129.69 MB)
* my_files_2016090121_20160902_060051_xxxxxxxxxx.csv.gz (151.7 MB)
* my_files_2016090122_20160902_060051_xxxxxxxxxx.csv.gz (346.46 MB)
* my_files_2016090122_20160902_060051_xxxxxxxxxx.csv.gz (222.57 MB)

This code should take them all:

{code:python}
beam.io.ReadFromText(
      "gs://XXXX_folder1/my_files_20160901*.csv.gz",
      skip_header_lines=1,
      compression_type=beam.io.filesystem.CompressionTypes.GZIP
      )
{code}

It runs well but there is only a 288.62 MB file in output of this pipeline 
(instead of a 1.5 GB file).

The whole pipeline code:

{code:python}
data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
          "gs://XXXX_folder1/my_files_20160901*.csv.gz",
          skip_header_lines=1,
          compression_type=beam.io.filesystem.CompressionTypes.GZIP
          )
                       | 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
                    )
output = (
          data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
num_shards=1)
            )
{code}

Dataflow indicates me that the estimated size   of the output after the 
ReadFromText step is 602.29 MB only, which not correspond to any unique input 
file size nor the overall file size matching with the pattern.



> ReadFromText function is not taking all data with glob operator (*) 
> --------------------------------------------------------------------
>
>                 Key: BEAM-2490
>                 URL: https://issues.apache.org/jira/browse/BEAM-2490
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py
>    Affects Versions: 2.0.0
>         Environment: Usage with Google Cloud Platform: Dataflow runner
>            Reporter: Olivier NGUYEN QUOC
>            Assignee: Ahmet Altay
>
> I run a very simple pipeline:
> * Read my files from Google Cloud Storage
> * Split with '\n' char
> * Write in on a Google Cloud Storage
> I have 6 files matching with the pattern:
> * my_files_2016090116_20160902_060051_xxxxxxxxxx.csv.gz (229.25 MB)
> * my_files_2016090117_20160902_060051_xxxxxxxxxx.csv.gz (184.1 MB)
> * my_files_2016090118_20160902_060051_xxxxxxxxxx.csv.gz (171.73 MB)
> * my_files_2016090119_20160902_060051_xxxxxxxxxx.csv.gz (151.34 MB)
> * my_files_2016090120_20160902_060051_xxxxxxxxxx.csv.gz (129.69 MB)
> * my_files_2016090121_20160902_060051_xxxxxxxxxx.csv.gz (151.7 MB)
> * my_files_2016090122_20160902_060051_xxxxxxxxxx.csv.gz (346.46 MB)
> * my_files_2016090122_20160902_060051_xxxxxxxxxx.csv.gz (222.57 MB)
> This code should take them all:
> {code:python}
> beam.io.ReadFromText(
>       "gs://XXXX_folder1/my_files_20160901*.csv.gz",
>       skip_header_lines=1,
>       compression_type=beam.io.filesystem.CompressionTypes.GZIP
>       )
> {code}
> It runs well but there is only a 288.62 MB file in output of this pipeline 
> (instead of a 1.5 GB file).
> The whole pipeline code:
> {code:python}
> data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
>           "gs://XXXX_folder1/my_files_20160901*.csv.gz",
>           skip_header_lines=1,
>           compression_type=beam.io.filesystem.CompressionTypes.GZIP
>           )
>                        | 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
>                     )
> output = (
>           data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
> num_shards=1)
>             )
> {code}
> Dataflow indicates me that the estimated size         of the output after the 
> ReadFromText step is 602.29 MB only, which not correspond to any unique input 
> file size nor the overall file size matching with the pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to