larry77 opened a new issue, #39857:
URL: https://github.com/apache/arrow/issues/39857

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Hello,
   Unfortunately the example involves a large dataset and, according to my 
tests, it appears when the number of read lines goes above 1.6 million.
   
   The data can be downloaded as a compressed file from (nothing dangerous in 
the link).
   
   https://e.pcloud.link/publink/show?code=XZqHIeZokLxWCpx940hw3y45fsKqJPAVK0X
   
   Using a script I have had for quite some time, I want to open the tsv (tab 
separated file) I get when I decompress the file and then save it as a parquet 
file without holding it (entirely) in memory.
   
   ``` r
   library(arrow)
   #> 
   #> Attaching package: 'arrow'
   #> The following object is masked from 'package:utils':
   #> 
   #>     timestamp
   
   data <- open_dataset("export.tsv",
     format = "tsv",
     skip_rows = 1, 
     schema = schema(
       AID_MEASURE_ID = string(), 
       DATE_CREATED = string(), 
       DATE_GRANTED = string(), 
       AA_PUBLISHED_DATE = string(), 
       SERVER_REF = string(), 
       AM_TITLE = string(), 
       AM_TITLE_EN = string(), 
       STATUS = string(), 
       AM_PROC_TYPE_CD = string(), 
       COFINANCE = string(), 
       OBJECTIVE = string(), 
       OTHER_OBJECTIVE_EN = string(), 
       AID_INSTRUMENT = string(), 
       OTHER_AID_INSTRUMENT_EN = string(), 
       BENEFICIARY_NAME = string(), 
       BENEFICIARY_NAME_ENGLISH = string(), 
       BENEFICIARY_NATIONAL_ID = string(), 
       BENEFICIARY_NAT_ID_TYPE_SD = string(), 
       BENEFICIARY_TYPE_SD = string(), 
       COUNTRY_SD = string(), 
       REGION_SD = string(), 
       SECTOR_SD = string(), 
       GRANTED_AMOUNT_FROM_EUR = double(), 
       NOMINAL_AMOUNT_EUR_FROM = double(), 
       GRANT_RANGE = string(),
       GRANTED_AMOUNT_RANGE_DESC=string(),
       GRANTING_AUTHORITY_NAME = string(), 
       GRANTING_AUTHORITY_NAME_EN = string(), 
       NUTS_CD = string(), 
       GRANTING_AUTHORITY_COUNTRY = string()
     )
     )
   
     
   write_dataset(
     data,
     format = "parquet",
     path = ".",
     max_rows_per_file = 1e7
   )
   #> Error: Invalid: CSV parser got out of sync with chunker
       
   
   sessionInfo()
   #> R version 4.3.2 (2023-10-31)
   #> Platform: x86_64-pc-linux-gnu (64-bit)
   #> Running under: Debian GNU/Linux 12 (bookworm)
   #> 
   #> Matrix products: default
   #> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0 
   #> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0
   #> 
   #> locale:
   #>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
   #>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
   #>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
   #>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
   #>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
   #> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
   #> 
   #> time zone: Europe/Brussels
   #> tzcode source: system (glibc)
   #> 
   #> attached base packages:
   #> [1] stats     graphics  grDevices utils     datasets  methods   base     
   #> 
   #> other attached packages:
   #> [1] arrow_14.0.0.2
   #> 
   #> loaded via a namespace (and not attached):
   #>  [1] vctrs_0.6.4       cli_3.6.1         knitr_1.45        rlang_1.1.2    
  
   #>  [5] xfun_0.41         purrr_1.0.2       styler_1.10.2     generics_0.1.3 
  
   #>  [9] assertthat_0.2.1  glue_1.6.2        bit_4.0.5         
htmltools_0.5.7  
   #> [13] fansi_1.0.5       rmarkdown_2.25    R.cache_0.16.0    tibble_3.2.1   
  
   #> [17] evaluate_0.23     fastmap_1.1.1     yaml_2.3.7        
lifecycle_1.0.4  
   #> [21] compiler_4.3.2    dplyr_1.1.3       fs_1.6.3          
pkgconfig_2.0.3  
   #> [25] R.oo_1.25.0       R.utils_2.12.2    digest_0.6.33     R6_2.5.1       
  
   #> [29] utf8_1.2.4        reprex_2.0.2      tidyselect_1.2.0  pillar_1.9.0   
  
   #> [33] magrittr_2.0.3    R.methodsS3_1.8.2 tools_4.3.2       withr_2.5.2    
  
   #> [37] bit64_4.0.5
   ```
   
   <sup>Created on 2024-01-30 with [reprex 
v2.0.2](https://reprex.tidyverse.org)</sup>
   
   Any idea of what the issue may be? Thanks! 
   
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to