Re: [I] `COPY ... PARTITIONED BY` with parquet causes "out of bounds" panic [datafusion]

via GitHub Thu, 30 May 2024 05:32:36 -0700


devinjdangelo commented on issue #10712:
URL: https://github.com/apache/datafusion/issues/10712#issuecomment-2139454150


   I was unable to reproduce this panic on main or v38. @samuelcolvin if you 
are able to provide more details about the parquet files which are triggering 
the issue (schema, sanitized values, directory structure) that may help 
reproduce.  
   
   Here is what I tried to reproduce:
   
   ```bash
   datafusion-cli
   DataFusion CLI v38.0.0
   > COPY (values ('1', 'a', 'x'), ('2', 'b', 'y'), ('3', 'c', 'z')) TO 
'test_files/scratch/copy/partitioned_table3/' STORED AS parquet PARTITIONED BY 
(column1, column3)
   ;
   +-------+
   | count |
   +-------+
   | 3     |
   +-------+
   1 row(s) fetched. 
   Elapsed 0.003 seconds.
   
   > CREATE EXTERNAL TABLE validate_partitioned_parquet3 STORED AS PARQUET 
   LOCATION 'test_files/scratch/copy/partitioned_table3/' PARTITIONED BY 
(column1, column3);
   0 row(s) fetched. 
   Elapsed 0.001 seconds.
   
   > COPY validate_partitioned_parquet3
   TO 'test_files/scratch/copy/partitioned_table3_rewrite'
   PARTITIONED BY (column1, column3)
   STORED AS PARQUET;
   +-------+
   | count |
   +-------+
   | 3     |
   +-------+
   1 row(s) fetched. 
   Elapsed 0.003 seconds.
   ```
   And trying to force dictionary encoding where possible also runs without 
panic
   
   ```bash
   DataFusion CLI v38.0.0
   > COPY (values 
   ('c', arrow_cast('foo', 'Dictionary(Int32, Utf8)'), arrow_cast('foo2', 
'Dictionary(Int32, Utf8)')), 
   ('d', arrow_cast('bar', 'Dictionary(Int32, Utf8)'), arrow_cast('bar2', 
'Dictionary(Int32, Utf8)'))) 
   to 'test_files/scratch/copy/part_dict_test' STORED AS PARQUET PARTITIONED BY 
(column2, column3);
   +-------+
   | count |
   +-------+
   | 2     |
   +-------+
   1 row(s) fetched. 
   Elapsed 0.004 seconds.
   
   > CREATE EXTERNAL TABLE dict_partitioned_test STORED AS PARQUET 
   LOCATION 'test_files/scratch/copy/part_dict_test/' PARTITIONED BY (column2, 
column3);
   0 row(s) fetched. 
   Elapsed 0.001 seconds.
   
   > select * from dict_partitioned_test;
   +---------+---------+---------+
   | column1 | column2 | column3 |
   +---------+---------+---------+
   | d       | bar     | bar2    |
   | c       | foo     | foo2    |
   +---------+---------+---------+
   2 row(s) fetched. 
   Elapsed 0.002 seconds.
   
   > COPY (select column1, arrow_cast(column2, 'Dictionary(Int32, Utf8)') as 
column2, 
   arrow_cast(column3, 'Dictionary(Int32, Utf8)') as column3 from 
dict_partitioned_test)
   TO 'test_files/scratch/copy/part_dict_test_rewrite'
   PARTITIONED BY (column2, column3)
   STORED AS PARQUET;
   +-------+
   | count |
   +-------+
   | 2     |
   +-------+
   1 row(s) fetched. 
   Elapsed 0.003 seconds.
   ```
   
   The [line that is 
panicking](https://github.com/apache/datafusion/blob/c775e4d6ea6dfe9c26a772b676552b9711004a3d/datafusion/core/src/datasource/file_format/write/demux.rs#L381)
 with index OOB would only happen if one of the partition arrays in the 
`RecordBatch` was extracted with fewer values than `RecordBatch::num_rows()`.
   
   For plain Utf8 arrays, this seems completely impossible given this code 
executes first (array.value(i) would panic first):
   
   
https://github.com/apache/datafusion/blob/c775e4d6ea6dfe9c26a772b676552b9711004a3d/datafusion/core/src/datasource/file_format/write/demux.rs#L339-L341
   
   For dictionary encoded arrays, I could imagine something in the downcast / 
iteration code here producing fewer values. I thought that the iteration over 
the downcasted array should always produce exactly `RecordBatch::num_rows()` 
values, but perhaps there is a case where this is wrong.
   
   
https://github.com/apache/datafusion/blob/c775e4d6ea6dfe9c26a772b676552b9711004a3d/datafusion/core/src/datasource/file_format/write/demux.rs#L346-L353
   
   Finally, the arrays constructed above are accessed like this:
   
   
https://github.com/apache/datafusion/blob/c775e4d6ea6dfe9c26a772b676552b9711004a3d/datafusion/core/src/datasource/file_format/write/demux.rs#L378-L385
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] `COPY ... PARTITIONED BY` with parquet causes "out of bounds" panic [datafusion]

Reply via email to