I'm working on ARROW-1974
<https://issues.apache.org/jira/browse/ARROW-1974> right
now, and it's turning out to be quite complex due to both Arrow and Parquet
allowing duplicate columns. Apparently you can also write duplicate column
names to parquet by way of spark.

In my opinion, allowing duplicate columns leads to lots of unnecessary
complexity. Pandas allows this, and there are lots of hacks and heuristics
to make it work. For example, if I ask for the "a" column in a parquet
file, which one do I mean?

I'm not convinced there are use cases that justify the additional
complexity, however I am definitely willing to be convinced.

Are there any use cases that justify the additional complexity?

If not, I propose that we disallow them in the arrow spec and implement
this behavior in all supported languages.

-Phillip

Reply via email to