devinjdangelo commented on issue #10712:
URL: https://github.com/apache/datafusion/issues/10712#issuecomment-2139454150
I was unable to reproduce this panic on main or v38. @samuelcolvin if you
are able to provide more details about the parquet files which are triggering
the issue (schema, sanitized values, directory structure) that may help
reproduce.
Here is what I tried to reproduce:
```bash
datafusion-cli
DataFusion CLI v38.0.0
> COPY (values ('1', 'a', 'x'), ('2', 'b', 'y'), ('3', 'c', 'z')) TO
'test_files/scratch/copy/partitioned_table3/' STORED AS parquet PARTITIONED BY
(column1, column3)
;
+-------+
| count |
+-------+
| 3 |
+-------+
1 row(s) fetched.
Elapsed 0.003 seconds.
> CREATE EXTERNAL TABLE validate_partitioned_parquet3 STORED AS PARQUET
LOCATION 'test_files/scratch/copy/partitioned_table3/' PARTITIONED BY
(column1, column3);
0 row(s) fetched.
Elapsed 0.001 seconds.
> COPY validate_partitioned_parquet3
TO 'test_files/scratch/copy/partitioned_table3_rewrite'
PARTITIONED BY (column1, column3)
STORED AS PARQUET;
+-------+
| count |
+-------+
| 3 |
+-------+
1 row(s) fetched.
Elapsed 0.003 seconds.
```
And trying to force dictionary encoding where possible also runs without
panic
```bash
DataFusion CLI v38.0.0
> COPY (values
('c', arrow_cast('foo', 'Dictionary(Int32, Utf8)'), arrow_cast('foo2',
'Dictionary(Int32, Utf8)')),
('d', arrow_cast('bar', 'Dictionary(Int32, Utf8)'), arrow_cast('bar2',
'Dictionary(Int32, Utf8)')))
to 'test_files/scratch/copy/part_dict_test' STORED AS PARQUET PARTITIONED BY
(column2, column3);
+-------+
| count |
+-------+
| 2 |
+-------+
1 row(s) fetched.
Elapsed 0.004 seconds.
> CREATE EXTERNAL TABLE dict_partitioned_test STORED AS PARQUET
LOCATION 'test_files/scratch/copy/part_dict_test/' PARTITIONED BY (column2,
column3);
0 row(s) fetched.
Elapsed 0.001 seconds.
> select * from dict_partitioned_test;
+---------+---------+---------+
| column1 | column2 | column3 |
+---------+---------+---------+
| d | bar | bar2 |
| c | foo | foo2 |
+---------+---------+---------+
2 row(s) fetched.
Elapsed 0.002 seconds.
> COPY (select column1, arrow_cast(column2, 'Dictionary(Int32, Utf8)') as
column2,
arrow_cast(column3, 'Dictionary(Int32, Utf8)') as column3 from
dict_partitioned_test)
TO 'test_files/scratch/copy/part_dict_test_rewrite'
PARTITIONED BY (column2, column3)
STORED AS PARQUET;
+-------+
| count |
+-------+
| 2 |
+-------+
1 row(s) fetched.
Elapsed 0.003 seconds.
```
The [line that is
panicking](https://github.com/apache/datafusion/blob/c775e4d6ea6dfe9c26a772b676552b9711004a3d/datafusion/core/src/datasource/file_format/write/demux.rs#L381)
with index OOB would only happen if one of the partition arrays in the
`RecordBatch` was extracted with fewer values than `RecordBatch::num_rows()`.
For plain Utf8 arrays, this seems completely impossible given this code
executes first (array.value(i) would panic first):
https://github.com/apache/datafusion/blob/c775e4d6ea6dfe9c26a772b676552b9711004a3d/datafusion/core/src/datasource/file_format/write/demux.rs#L339-L341
For dictionary encoded arrays, I could imagine something in the downcast /
iteration code here producing fewer values. I thought that the iteration over
the downcasted array should always produce exactly `RecordBatch::num_rows()`
values, but perhaps there is a case where this is wrong.
https://github.com/apache/datafusion/blob/c775e4d6ea6dfe9c26a772b676552b9711004a3d/datafusion/core/src/datasource/file_format/write/demux.rs#L346-L353
Finally, the arrays constructed above are accessed like this:
https://github.com/apache/datafusion/blob/c775e4d6ea6dfe9c26a772b676552b9711004a3d/datafusion/core/src/datasource/file_format/write/demux.rs#L378-L385
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]