nchammas commented on issue #1894:
URL: https://github.com/apache/iceberg/issues/1894#issuecomment-774284421


   Hmm, I guess my mental model of how Iceberg works is incorrect.
   
   What does it mean for the `partitionBy()` columns to "not match the table"? 
I thought that `partitionBy()` was an output directive, allowing me to read in 
a table partitioned one way and write it out partitioned a different way.
   
   I'm not using any catalogs here. The `partition_balanced_data` DataFrame was 
read directly from an S3 path. There are 3 partitioning columns on the original 
data. I am attempting to write this data out to a separate location on S3 with 
only 2 partitioning columns.
   
   So the partitioning columns on the output _do_ differ from the source table, 
but I'm confused as to why this would matter to Iceberg.
   
   In other words, what's wrong with this contrived example?
   
   ```python
   data = spark.createDataFrame([(1, 2, 3, 4)], schema=['a', 'b', 'c', 'd'])
   data.write.partitionBy(['a', 'b', 'c']).parquet('./data')
   data = spark.read.parquet('./data')
   data.write.format('iceberg').partitionBy(['a', 'b']).save('./data-iceberg')
   ```
   
   This example code actually works, but I am understanding from your comment 
that there may be something incorrect about it. This is roughly what I was 
trying to do when I got the stack trace I posted above.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to