[jira] [Comment Edited] (BEAM-2490) ReadFromText function is not taking all data with glob operator (*)

JIRA Thu, 27 Jul 2017 11:10:40 -0700

    [ 
https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16103624#comment-16103624
 ]


Guillermo Rodríguez Cano edited comment on BEAM-2490 at 7/27/17 6:09 PM:
-------------------------------------------------------------------------

Thanks [~chamikara] for the tip with the option, and yes, I meant that the 
performance was so slow with the DirectRunner that I would never be able to see 
the job finish (actually my test with just one compressed file took 'only' four 
hours after all...).
It is quite a discovery the mentioned parameter as I haven't seen it anywhere 
and I was trying to package Beam the documented way but it was not getting 
accepted. Thanks again.

Now, with the latest Beam's HEAD (as of this writing) I can safely confirm that 
the glob operator works.
I tried the same compressed files I have mentioned before (first just one, 
though still using the glob operator, and then the seven aforementioned files) 
and they got fully processed, and extremely fast (I am not sure why such 
performance's difference with my laptop as Dataflow is autoscaling to two 
workers of n1-standard-4 type).
Also, for the sake of the upcoming Beam's release (2.1.0), I did the same tests 
with the current release candidate (RC2), with the same successful results.

So this issue can be closed :)

The performance issue with the DirectRunner can be addressed in the 
corresponding issue then (I will help with that if needed, not sure if I can 
share the data [~altay] even with some anonymization on our end, but I will 
ask...).



was (Author: wileeam):
Thanks [~chamikara] for the tip with the option, and yes, I meant that the 
performance was so slow with the DirectRunner that I would never be able to see 
the job finish (actually my test with just one compressed file took 'only' four 
hours after all...).
It is quite a discovery the mentioned parameter as I haven't seen it anywhere 
and I was trying to package the Beam the documented way but it was not getting 
accepted. Thanks again.

Now, with the latest Beam's HEAD (as of this writing) I can safely confirm that 
the glob operator works.
I tried the same compressed files I have mentioned before (first just one, 
though still using the glob operator, and then the seven aforementioned files) 
and they got fully processed, and extremely fast (I am not sure why such 
performance's difference with my laptop as Dataflow is autoscaling to two 
workers of n1-standard-4 type).
Also, for the sake of the upcoming Beam's release (2.1.0), I did the same tests 
with the current release candidate (RC2), with the same successful results.

So this issue can be closed :)

The performance issue with the DirectRunner can be addressed in the 
corresponding issue then (I will help with that if needed, not sure if I can 
share the data [~altay] even with some anonymization on our end, but I will 
ask...).


> ReadFromText function is not taking all data with glob operator (*) 
> --------------------------------------------------------------------
>
>                 Key: BEAM-2490
>                 URL: https://issues.apache.org/jira/browse/BEAM-2490
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py
>    Affects Versions: 2.0.0
>         Environment: Usage with Google Cloud Platform: Dataflow runner
>            Reporter: Olivier NGUYEN QUOC
>            Assignee: Chamikara Jayalath
>             Fix For: Not applicable
>
>
> I run a very simple pipeline:
> * Read my files from Google Cloud Storage
> * Split with '\n' char
> * Write in on a Google Cloud Storage
> I have 8 files that match with the pattern:
> * my_files_2016090116_20160902_060051_xxxxxxxxxx.csv.gz (229.25 MB)
> * my_files_2016090117_20160902_060051_xxxxxxxxxx.csv.gz (184.1 MB)
> * my_files_2016090118_20160902_060051_xxxxxxxxxx.csv.gz (171.73 MB)
> * my_files_2016090119_20160902_060051_xxxxxxxxxx.csv.gz (151.34 MB)
> * my_files_2016090120_20160902_060051_xxxxxxxxxx.csv.gz (129.69 MB)
> * my_files_2016090121_20160902_060051_xxxxxxxxxx.csv.gz (151.7 MB)
> * my_files_2016090122_20160902_060051_xxxxxxxxxx.csv.gz (346.46 MB)
> * my_files_2016090122_20160902_060051_xxxxxxxxxx.csv.gz (222.57 MB)
> This code should take them all:
> {code:python}
> beam.io.ReadFromText(
>       "gs://XXXX_folder1/my_files_20160901*.csv.gz",
>       skip_header_lines=1,
>       compression_type=beam.io.filesystem.CompressionTypes.GZIP
>       )
> {code}
> It runs well but there is only a 288.62 MB file in output of this pipeline 
> (instead of a 1.5 GB file).
> The whole pipeline code:
> {code:python}
> data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
>           "gs://XXXX_folder1/my_files_20160901*.csv.gz",
>           skip_header_lines=1,
>           compression_type=beam.io.filesystem.CompressionTypes.GZIP
>           )
>                        | 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
>                     )
> output = (
>           data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
> num_shards=1)
>             )
> {code}
> Dataflow indicates me that the estimated size         of the output after the 
> ReadFromText step is 602.29 MB only, which not correspond to any unique input 
> file size nor the overall file size matching with the pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (BEAM-2490) ReadFromText function is not taking all data with glob operator (*)

Reply via email to