[I] segmentation fault from opening large single column csv with small blockszie pyarrow.csv.open_csv() [arrow]

via GitHub Fri, 24 Nov 2023 13:16:12 -0800


jiale0402 opened a new issue, #38878:
URL: https://github.com/apache/arrow/issues/38878


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   platform: 
   `NAME="Ubuntu"
   VERSION="23.04 (Lunar Lobster)"`
   pyarrow version:
   `pyarrow                14.0.1`
   `pyarrow-hotfix          0.5 `
   python version:
   `Python 3.11.4 (main, Jun  9 2023, 07:59:55) [GCC 12.3.0] on linux`
   
   
   I have a very large single column csv file (about 63 million rows). I was 
hoping to create a lazy file streamer that reads one entry from the csv file at 
a time. I know each entry in my file has a length of 12 chars, so I tried 
setting block size to 13 (+1 for \n) with the pyarrow.csv.open_csv function.
   `import pyarrow.csv as csv`
   `c_options = csv.ConvertOptions(column_types={'dne': pa.float32()})`
   `r_options = csv.ReadOptions(skip_rows_after_names=8200,use_threads=True, 
column_names=["dne"],block_size=13)`
   `stream = csv.open_csv(file, convert_options = c_options,
                       read_options = r_options 
   )`
   this code functions properly as expected, but when i change the 
`skip_rows_after_names` param of read options to 8300 I start to get 
segmentation faults. How to fix this (or am I using it wrong)? I want to be 
able to use only a portion of at (like from row 98885 to 111200)
   
   
   I was able to produce this error on another computer with the exact same 
platform and versions. The file was created with 
   `with open(f"feature_{i}.csv", "w+") as f:
           for i in range(FILE_LEN):
               n = random.uniform(-0.5, 0.5)
               nn = str(n)[:12]
               f.write(f"{nn}\n")`
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] segmentation fault from opening large single column csv with small blockszie pyarrow.csv.open_csv() [arrow]

Reply via email to