Re: [I] [R] Partitioned Datasets are not correctly written [arrow]

via GitHub Thu, 23 Nov 2023 02:35:18 -0800


lgaborini commented on issue #38804:
URL: https://github.com/apache/arrow/issues/38804#issuecomment-1824165163


   Updating the issue: it has something to do with partitioning on a 
case-sensitive field on a case-insensitive filesystem.
   The behavior also changes depending on compression and dictionary encoding.
   
   I am using this data frame, partitioning is on `var_2`:
   
   ``` r
   
   tbl_input <- tibble::tibble(
      var_1 = c("arrow", "arrow", "arrow"),
      var_2 = c("arrow", "arrow", "arroW")
   )
   tbl_input
   #> # A tibble: 3 × 2
   #>   var_1 var_2
   #>   <chr> <chr>
   #> 1 arrow arrow
   #> 2 arrow arrow
   #> 3 arrow arroW
   ```
   
   ### No compression, dictionary-encoding
   
   Writing and re-reading with these options:
   
   ```r
   arrow::write_dataset(
      dataset = tbl_input,
      path = f_dataset_merged,
      partitioning = "var_2",
      compression = "uncompressed",
      use_dictionary = TRUE
   )
   
   tbl_written <- arrow::open_dataset(
      sources = f_dataset_merged,
      partitioning = arrow::hive_partition()
   )
   ```
   
   No errors while writing or reading, but results change *from run to run*:
   
   ```r
   tbl_written |>
      dplyr::collect()
   #> # A tibble: 2 × 2
   #>   var_1 var_2
   #>   <chr> <chr>
   #> 1 arrow arroW
   #> 2 arrow arroW
   ```
   
   This is the most common output:
   
   ```r
   
   tbl_written |>
      dplyr::collect()
   #> # A tibble: 1 × 2
   #>   var_1 var_2
   #>   <chr> <chr>
   #> 1 arrow arroW
   ```
   
   Sometimes I get:
   
   ```r
   
   tbl_written |>
      dplyr::collect()
   #> # A tibble: 2 × 2
   #>   var_1 var_2
   #>   <chr> <chr>
   #> 1 arrow arroW
   #> 2 arrow arroW
   ```
   
   ### No compression, no dictionary-encoding
   
   Writing with these options:
   
   ```r
   
   arrow::write_dataset(
      dataset = tbl_input,
      path = f_dataset_merged,
      partitioning = "var_2",
      compression = "uncompressed",
      use_dictionary = FALSE
   )
   ```
   
   Reading either fails:
   
   ```r
   
   tbl_written <- arrow::open_dataset(
      sources = f_dataset_merged,
      partitioning = arrow::hive_partition()
   )
   #> Error in `arrow::open_dataset()`:
   #> ! IOError: Error creating dataset. Could not read schema from 
'/Temp/RtmpEfdPRA/file808821056880/var_2=arroW/part-0.parquet'. Is this a 
'parquet' file?: Could not open Parquet input source 
'/Temp/RtmpEfdPRA/file808821056880/var_2=arroW/part-0.parquet': Couldn't 
deserialize thrift: TProtocolException: Invalid data
   ```
   
   Or succeeds, but the next `collect()` call fails:
   
   ```r
   
   tbl_written |>
      dplyr::collect()
   #> Error in `compute.Dataset()`:
   #> ! IOError: Unexpected end of stream
   ```
   
   Or everything succeeds:
   
   ```r
   tbl_written |>
      dplyr::collect()
   #> # A tibble: 2 × 2
   #>   var_1 var_2
   #>   <chr> <chr>
   #> 1 arrow arroW
   #> 2 arrow arroW
   ```
   
   ### Snappy compression, no dictionary-encoding
   
   Writing with these options:
   ```r
   arrow::write_dataset(
      dataset = tbl_input,
      path = f_dataset_merged,
      partitioning = "var_2",
      compression = "snappy",
      use_dictionary = FALSE
   )
   ```
   
   This is the most common error:
   
   ```r
   #> Error in `arrow::open_dataset()`:
   #> ! IOError: Error creating dataset. Could not read schema from 
'/Temp/RtmpSMXExe/file5d645de7595e/var_2=arroW/part-0.parquet'. Is this a 
'parquet' file?: Could not open Parquet input source 
'/Temp/RtmpSMXExe/file5d645de7595e/var_2=arroW/part-0.parquet': Couldn't 
deserialize thrift: don't know what type: 
   ```
   
   Can also get:
   
   ```r
   
   tbl_written |>
      dplyr::collect()
   #> Error in `compute.Dataset()`:
   #> ! IOError: Couldn't deserialize thrift: No more data to read.
   #> Deserializing page header failed.
   ``` 
   
   or:
   
   ```r
   tbl_written |>
      dplyr::collect()
   #> Error in `compute.Dataset()`:
   #> ! IOError: Unexpected end of stream
   ```
   
   Sometimes it even succeeds.
   
   ### Snappy compression, dictionary encoding
   
   Writing with these options:
   
   ```r
   arrow::write_dataset(
      dataset = tbl_input,
      path = f_dataset_merged,
      partitioning = "var_2",
      compression = "snappy",
      use_dictionary = TRUE
   )
   ```
   
   Reading always succeeds but the which rows are read is unpredictable:
   
   ```r
   tbl_written |>
      dplyr::collect()
   #> # A tibble: 2 × 2
   #>   var_1 var_2
   #>   <chr> <chr>
   #> 1 arrow arroW
   #> 2 arrow arroW
   ```
   
   or
   
   ```r
   tbl_written |>
      dplyr::collect()
   #> # A tibble: 1 × 2
   #>   var_1 var_2
   #>   <chr> <chr>
   #> 1 arrow arroW
   ```
   
   or even
   
   ```r
   tbl_written |>
      dplyr::collect()
   #> # A tibble: 1 × 2
   #>   var_1 var_2
   #>   <chr> <chr>
   #> 1 arrow arrow
   ```
   
   ### Proposals
   
   I would expect:
   
   1. some warning when partitions collide due to difference in capitalization, 
or silent merging
   2. if the partitioned variable is stored in the Parquets (as per 14.0.1, 
right?), the partitioning might just be for queries, but the reading to be 
lossless.
   3. I expect a correct behavior on Linux/OSX since case-sensitivity is not an 
issue
   
   If everything must be OS-independent, probably the easiest mitigation is to 
avoid having partitions that differ just in capitalization. (I also wonder what 
happens when illegal characters are used)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [R] Partitioned Datasets are not correctly written [arrow]

Reply via email to