julien richard created BEAM-10573:
-------------------------------------

             Summary: Size CSV file are loaded several times
                 Key: BEAM-10573
                 URL: https://issues.apache.org/jira/browse/BEAM-10573
             Project: Beam
          Issue Type: Bug
          Components: io-py-files
    Affects Versions: 2.22.0
            Reporter: julien richard


I have this small sample:

 
{code:java}
import apache_beam as beam
import apache_beam.io.filebasedsource
import csv


class CsvFileSource(apache_beam.io.filebasedsource.FileBasedSource):
 def read_records(self, file_name, range_tracker):
     with open(file_name, 'r') as file:
        reader = csv.DictReader(file)
        print("Load CSV file")
           for rec in reader:
              yield rec


if __name__ == '__main__':
 with beam.Pipeline() as p:
 count_feature = (p
           | 'create' >> beam.io.Read(CsvFileSource("myFile.csv"))
           | 'count element' >> beam.combiners.Count.Globally()
           | 'Print' >> beam.Map(print)
 ){code}
 

 

for some reason if CSV file is large the is loaded several times...

for example for a file with 80000 rows (18.5 mo) the file is loaded 5 times.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to