[I] Writing a Dataframe to a partitioned table ignores partitioning [arrow-datafusion]

via GitHub Wed, 18 Oct 2023 11:50:57 -0700


theelderbeever opened a new issue, #7860:
URL: https://github.com/apache/arrow-datafusion/issues/7860


   ### Describe the bug
   
   After registering a parquet table with a partition column I write a 
dataframe to that table then attempt to read back the data from the table. This 
results in an error `index out of bounds: the len is 0 but the index is 0`. 
Upon inspecting the table path there is a written parquet file. This parquet 
file is readable although when being read directly it still contains the column 
for partitioning which to my understanding shouldn't be there if 
hive_partitioning is used. Additionally, the directory for the table contains 
no partitions and the parquet is at the top level. 
   
   ```console
   ./data
   └── metrics
       └── hUnp6bKNqkJv4VHK_0.parquet
   ```
   
   Also two things ofof note:
   - the compression is not included in the file suffix which is generally 
expected in the spark world
   - I believe the standard for spark is that the filename is actually a uuid
   
   ### To Reproduce
   
   This code is missing some of the supporting methods but is the gist of where 
the problem is. 
   
   ```rust
       let records: Vec<VectorMetric> = serde_json::from_str(json).unwrap();
   
       let batch = VectorMetric::to_record_batch(records);
   
       let schema = Schema::new(VectorMetric::fields());
       let ctx = SessionContext::new();
       ctx.register_parquet(
           "metrics",
           "data/metrics",
           ParquetReadOptions::default()
               .table_partition_cols(vec![(
                   "time_bucket".to_string(),
                   DataType::Timestamp(TimeUnit::Second, None),
               )])
               .schema(&schema),
       )
       .await?;
       ctx.register_batch("batch", batch)?;
   
       let write_options =
           
DataFrameWriteOptions::default().with_compression(CompressionTypeVariant::ZSTD);
   
       ctx.sql("SELECT *, DATE_TRUNC('DAY', timestamp) AS time_bucket FROM 
batch")
           .await?
           .write_table("metrics", write_options)
           .await?;
   
       let df = ctx.sql("SELECT * FROM metrics").await?.collect().await?;
       println!("{}", pretty_format_batches(&df).unwrap());
   ```
   
   ### Expected behavior
   
   - Tables with partition cols should auto partition when writing to them.
   - Compression codec should be included in the parquet file suffix
   
   ### Additional context
   
   I can sanitize data and send the rest of the example if needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Writing a Dataframe to a partitioned table ignores partitioning [arrow-datafusion]

Reply via email to