[issue28642] csv reader loosing rows with big files and tab delimiter

Marc Garcia Tue, 08 Nov 2016 09:21:54 -0800

New submission from Marc Garcia:

I'm using the csv module from Python standard library, to read a 1.4Gb file 
with 11,157,064 of rows. The file is the Geonames dataset for all countries, 
which can be freely downloaded [1].


I'm using this code to read it:

    import csv

    with open('allCountries.txt', 'r') as fd:
        reader = csv.reader(fd, delimiter='\t')
        for i, row in enumerate(reader):
            pass

    print(i + 1)  # prints 10381963
    print(reader.line_num)  # prints 11157064

For some reason, there are around 7% of the rows in the files, that are 
skipped. The rows doesn't have anything special (most of them are all ascii 
characters, even if the file is in utf-8).

If I create a new file with all the skipped files, and I read it again in the 
same way, around 30% of the rows are skipped. So many of them weren't returned 
by the iterator when being a part of a bigger file, but now they are.

Note that the attribute line_num has the right number. Also note that if I 
remove the delimiter parameter (tab) from the reader, and it uses the default 
comma, the iteration on the reader doesn't skip any row.

I checked what I think it's the relevant part of the code [2], but I couldn't 
see anything that could cause this bug.


1. http://download.geonames.org/export/dump/allCountries.zip
2. https://hg.python.org/cpython/file/tip/Modules/_csv.c#l787

----------
components: Library (Lib)
messages: 280323
nosy: datapythonista
priority: normal
severity: normal
status: open
title: csv reader loosing rows with big files and tab delimiter
versions: Python 3.5

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue28642>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue28642] csv reader loosing rows with big files and tab delimiter

Reply via email to