Hi Dongjoon, Thanks for the info. Unfortunately I did not find any means to fix the issue without forcing CONVERT_METASTORE_ORC or changing the ORC reader implementation. Closing the PR, as it was only used to demonstrate the root cause. Best regards, Mark
On Tue, Nov 14, 2017 at 6:58 PM, Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > Hi, Mark. > > That is one of the reasons why I left it behind from the previous PR > (below) and I'm focusing is the second approach; use OrcFileFormat with > convertMetastoreOrc. > > https://github.com/apache/spark/pull/19470 > [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark schema to read ORC > table instead of ORC file schema > > With `convertMetastoreOrc=true`, Spark 2.3 will become stabler and faster. > Also, it's the default Spark way to handle Parquet. > > BTW, thank you for looking at SPARK-22267. So far, I'm not looking at that > issue. > If we have a fix for SPARK-22267 in Spark 2.3, it would be great! > > Bests, > Dongjoon. > > > On Tue, Nov 14, 2017 at 3:46 AM, Mark Petruska <petruska.m...@gmail.com> > wrote: > >> Hi, >> I'm very new to spark development, and would like to get guidance from >> more experienced members. >> Sorry this email will be long as I try to explain the details. >> >> Started to investigate the issue SPARK-22267 >> <https://issues.apache.org/jira/browse/SPARK-22267>; added some test >> cases to highlight the problem in the PR >> <https://github.com/apache/spark/pull/19744>. Here are my findings: >> >> - for parquet the test case succeeds as expected >> >> - the sql test case for orc: >> - when CONVERT_METASTORE_ORC is set to "true" the data fields are >> presented in the desired order >> - when it is "false" the columns are read in the wrong order >> - Reason: when `isConvertible` returns true in `RelationConversions` >> the plan executes `convertToLogicalRelation`, which in turn uses >> `OrcFileFormat` to read the data; otherwise it uses the classes in >> "hive-exec:1.2.1". >> >> - the HadoopRDD test case was added to further investigate the parameter >> values to discover a working combination, but unfortunately no combination >> of "serialization.ddl" and "columns" result in success. It seems that those >> fields do not have any effect on the order of the resulting data fields. >> >> >> At this point I do not see any option to fix this issue without risking >> "backward compatibility" problems. >> The possible actions (as I see them): >> - link a new version of "hive-exec": surely this bug has been fixed in a >> newer version >> - use `OrcFileFormat` for reading orc data regardless of the setting of >> CONVERT_METASTORE_ORC >> - also there's an `OrcNewInputFormat` class in "hive-exec", but it >> implements an InputFormat interface from a different package, hence it is >> incompatible with HadoopRDD at the moment >> >> Please help me. Did I miss some viable options? >> >> Thanks, >> Mark >> >> >