julien richard created BEAM-10573:
-------------------------------------
Summary: Size CSV file are loaded several times
Key: BEAM-10573
URL: https://issues.apache.org/jira/browse/BEAM-10573
Project: Beam
Issue Type: Bug
Components: io-py-files
Affects Versions: 2.22.0
Reporter: julien richard
I have this small sample:
{code:java}
import apache_beam as beam
import apache_beam.io.filebasedsource
import csv
class CsvFileSource(apache_beam.io.filebasedsource.FileBasedSource):
def read_records(self, file_name, range_tracker):
with open(file_name, 'r') as file:
reader = csv.DictReader(file)
print("Load CSV file")
for rec in reader:
yield rec
if __name__ == '__main__':
with beam.Pipeline() as p:
count_feature = (p
| 'create' >> beam.io.Read(CsvFileSource("myFile.csv"))
| 'count element' >> beam.combiners.Count.Globally()
| 'Print' >> beam.Map(print)
){code}
for some reason if CSV file is large the is loaded several times...
for example for a file with 80000 rows (18.5 mo) the file is loaded 5 times.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)