I'm working on ARROW-1974 <https://issues.apache.org/jira/browse/ARROW-1974> right now, and it's turning out to be quite complex due to both Arrow and Parquet allowing duplicate columns. Apparently you can also write duplicate column names to parquet by way of spark.
In my opinion, allowing duplicate columns leads to lots of unnecessary complexity. Pandas allows this, and there are lots of hacks and heuristics to make it work. For example, if I ask for the "a" column in a parquet file, which one do I mean? I'm not convinced there are use cases that justify the additional complexity, however I am definitely willing to be convinced. Are there any use cases that justify the additional complexity? If not, I propose that we disallow them in the arrow spec and implement this behavior in all supported languages. -Phillip
