[GitHub] [arrow] zijie0 opened a new issue #11718: [CPP] Arrow does not decode partition column name from directory path

GitBox Tue, 16 Nov 2021 18:42:32 -0800


zijie0 opened a new issue #11718:
URL: https://github.com/apache/arrow/issues/11718



   When using Delta to write out partitioned table, both the column name and 
value would be uri encoded. For example: 
   
   ```
   sdf = spark.createDataFrame([[1, 10], [2, 20]], schema=['x%20x', 'y'])
   sdf.write.format("delta").partitionBy(["x%20x"]).save("partition_table")
   ```
   
   The above code would generate directory in following structure:
   
   ```
   ├─x%2520x=1
   ├─x%2520x=2
   └─_delta_log
   ```
   
   When we use Pyarrow to read this table, it would generate incorrect column 
name from `x%20x` to `x%2520%x`:
   
   ```
   In [3]: ds = dataset(paths, format="parquet", 
partitioning=partitioning(flavor="hive"))
   
   In [4]: ds.to_table().to_pandas()
   Out[4]:
       y  x%2520x
   0  20        2
   1  10        1
   ```
   
   It seems that we did not decode the column name here: 
https://github.com/apache/arrow/blob/e5f3e04b4b80c9b9c53f1f0f71f39d9f8308dced/cpp/src/arrow/dataset/partition.cc#L593-L596
   
   More context from this delta-rs issue: 
https://github.com/delta-io/delta-rs/issues/495
   
   Could anyone take a look please? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] zijie0 opened a new issue #11718: [CPP] Arrow does not decode partition column name from directory path

Reply via email to