[jira] [Comment Edited] (BEAM-2490) ReadFromText function is not taking all data with glob operator (*)

JIRA Thu, 27 Jul 2017 03:35:18 -0700

    [ 
https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16103045#comment-16103045
 ]


Guillermo Rodríguez Cano edited comment on BEAM-2490 at 7/27/17 10:34 AM:
--------------------------------------------------------------------------

Hello again, and a quick update,

* OS: Mac OS X Sierra 10.12.6
* Apache Beam: 2.2.0dev (aka HEAD at master branch as of 8 hours ago...)
* Python: 2.7.13 
* Runner: DirectRunner (so far given the "results")

I ran pretty much the same experiment I ran at the end of June (described here: 
https://issues.apache.org/jira/browse/BEAM-2490?focusedCommentId=16063224) with 
the latest version as of the HEAD in the master branch of the Apache Beam 
repository and unfortunately the results are the same so far: no results.

My laptop was running this all night and after 8 hours it is still not finished 
(for a 'job' of 8 files gzipped JSON files of 200-300 MB compressed) and no 
output. I also ran the same experiment with only one file in the subdirectory 
where I use the operator, and it is still running although I got some output 
but I don't think that it is ok that it takes more than 3 hours to process just 
one file...
Since these tests haven't finished I couldn't test on DataFlow yet (besides I 
still haven't figured out how to package the HEAD or a tag for that matter of 
beam for DataFlow. No matter how I try, I always get something along this line: 
{{Could not find a version that satisfies the requirement apache-beam==2.1.0 
(from versions: 0.6.0, 2.0.0)}} Suggestions?).

So I can't confirm that this issue is really resolved unfortunately. I don't 
think this issue is related to https://issues.apache.org/jira/browse/BEAM-2497 
and more to https://issues.apache.org/jira/browse/BEAM-2531
I suspect all files are read (hence it is likely that the glob operator works) 
but due to the performance of the decompression we don't know that for sure.


was (Author: wileeam):
Hello again, and a quick update,

* OS: Mac OS X Sierra 10.12.6
* Apache Beam: 2.2.0dev (aka HEAD at master branch as of 8 hours ago...)
* Python: 2.7.13 
* Runner: DirectRunner (so far given the "results")

I ran pretty much the same experiment I ran at the end of June (described here: 
https://issues.apache.org/jira/browse/BEAM-2490?focusedCommentId=16063224) with 
the latest version as of the HEAD in the master branch of the Apache Beam 
repository and unfortunately the results are the same so far: no results.

My laptop was running this all night and after 8 hours it is still not finished 
(for a 'job' of 8 files gzipped JSON files of 200-300 MB compressed) and no 
output. I also ran the same experiment with only one file in the subdirectory 
where I use the operator, and it is still running although I got some output 
but I don't think that it is ok that it takes more than 3 hours to process just 
one file...
Since these tests haven't finished I couldn't test on DataFlow yet (besides I 
still haven't figured out how to package the HEAD or a tag for that matter of 
beam for DataFlow. No matter how I try, I always get something along this line: 
`Could not find a version that satisfies the requirement apache-beam==2.1.0 
(from versions: 0.6.0, 2.0.0)` Suggestions?).

So I can't confirm that this issue is really resolved unfortunately. I don't 
think this issue is related to https://issues.apache.org/jira/browse/BEAM-2497 
and more to https://issues.apache.org/jira/browse/BEAM-2531
I suspect all files are read (hence it is likely that the glob operator works) 
but due to the performance of the decompression we don't know that for sure.

> ReadFromText function is not taking all data with glob operator (*) 
> --------------------------------------------------------------------
>
>                 Key: BEAM-2490
>                 URL: https://issues.apache.org/jira/browse/BEAM-2490
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py
>    Affects Versions: 2.0.0
>         Environment: Usage with Google Cloud Platform: Dataflow runner
>            Reporter: Olivier NGUYEN QUOC
>            Assignee: Chamikara Jayalath
>             Fix For: Not applicable
>
>
> I run a very simple pipeline:
> * Read my files from Google Cloud Storage
> * Split with '\n' char
> * Write in on a Google Cloud Storage
> I have 8 files that match with the pattern:
> * my_files_2016090116_20160902_060051_xxxxxxxxxx.csv.gz (229.25 MB)
> * my_files_2016090117_20160902_060051_xxxxxxxxxx.csv.gz (184.1 MB)
> * my_files_2016090118_20160902_060051_xxxxxxxxxx.csv.gz (171.73 MB)
> * my_files_2016090119_20160902_060051_xxxxxxxxxx.csv.gz (151.34 MB)
> * my_files_2016090120_20160902_060051_xxxxxxxxxx.csv.gz (129.69 MB)
> * my_files_2016090121_20160902_060051_xxxxxxxxxx.csv.gz (151.7 MB)
> * my_files_2016090122_20160902_060051_xxxxxxxxxx.csv.gz (346.46 MB)
> * my_files_2016090122_20160902_060051_xxxxxxxxxx.csv.gz (222.57 MB)
> This code should take them all:
> {code:python}
> beam.io.ReadFromText(
>       "gs://XXXX_folder1/my_files_20160901*.csv.gz",
>       skip_header_lines=1,
>       compression_type=beam.io.filesystem.CompressionTypes.GZIP
>       )
> {code}
> It runs well but there is only a 288.62 MB file in output of this pipeline 
> (instead of a 1.5 GB file).
> The whole pipeline code:
> {code:python}
> data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
>           "gs://XXXX_folder1/my_files_20160901*.csv.gz",
>           skip_header_lines=1,
>           compression_type=beam.io.filesystem.CompressionTypes.GZIP
>           )
>                        | 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
>                     )
> output = (
>           data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
> num_shards=1)
>             )
> {code}
> Dataflow indicates me that the estimated size         of the output after the 
> ReadFromText step is 602.29 MB only, which not correspond to any unique input 
> file size nor the overall file size matching with the pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (BEAM-2490) ReadFromText function is not taking all data with glob operator (*)

Reply via email to