[ https://issues.apache.org/jira/browse/BEAM-7998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16910622#comment-16910622 ]
Jerome MASSOT commented on BEAM-7998: ------------------------------------- Hi Pablo, thanks for taking care of this apparent issue. For confidentiality reason, I cannot share the entire code. But this is the snippet where I have detected the behavior : I have a bucket in GCP and these are the arguments for the runner. The strange behavior is both with DirectRunner and DataFlowRunner. gcs_folder = 'gs://{}/chunks-dashboard/'.format(bucket_id) argv = [ '--project={}'.format(project_id), '--job_name=chunk-dashboard', '--save_main_session', '--staging_location=' + gcs_folder + 'staging', '--temp_location=' + gcs_folder + 'temp', '--max_num_workers=10', '--autoscaling_algorithm=THROUGHPUT_BASED', '--runner=DirectRunner', '--setup_file=./setup.py', '--machine_type=n1-standard-4' ] pipeline_options = beam.options.pipeline_options.PipelineOptions(argv) I use *.json widcard as follows : # retrieve the path of the chunk folder in the bucket and wildcard the chunk files chunks_folder = 'gs://{}/{}'.format(bucket_id, folder_id) chunk_files = chunks_folder + '/*.json' with beam.Pipeline(options=pipeline_options) as p: chunk_content = (p | fileio.MatchFiles(chunk_files) | fileio.ReadMatches() ) When I run this pipeline on a folder where a single json file is stored, the pipeline finds twice the match to this unique json file. Strange... Thanks for your help, and Good luck Best regards Jerome > MatchesFiles or MatchAll seems to return seveval time the same element > ---------------------------------------------------------------------- > > Key: BEAM-7998 > URL: https://issues.apache.org/jira/browse/BEAM-7998 > Project: Beam > Issue Type: Bug > Components: io-py-files > Affects Versions: 2.14.0 > Environment: GCP for storage, DirectRunner and DataflowRunner both > have the problem. PyCharm on Win10 for IDE and dev environment. > Reporter: Jerome MASSOT > Assignee: Pablo Estrada > Priority: Major > > Hi team, > when I use MatcheFiles using wildcard and files located in a GCP bucket, the > MatcheFiles transform returns several times (at least 2) the same file. > I have tried to follow the stack, and I can see that the MatchesAll is called > twice when I run the pipeline on a debug project where a single element is > present in the bucket. > But I am not good enough to say more than that. Sorry. > Best regards > Jerome -- This message was sent by Atlassian Jira (v8.3.2#803003)