Re: SPARK-22267 issue: Spark SQL incorrectly reads ORC file when column order is different

Mark Petruska Wed, 15 Nov 2017 02:56:19 -0800

  Hi Dongjoon,
Thanks for the info.
Unfortunately I did not find any means to fix the issue without
forcing CONVERT_METASTORE_ORC
or changing the ORC reader implementation.
Closing the PR, as it was only used to demonstrate the root cause.
Best regards,
Mark


On Tue, Nov 14, 2017 at 6:58 PM, Dongjoon Hyun <dongjoon.h...@gmail.com>
wrote:

> Hi, Mark.
>
> That is one of the reasons why I left it behind from the previous PR
> (below) and I'm focusing is the second approach; use OrcFileFormat with
> convertMetastoreOrc.
>
> https://github.com/apache/spark/pull/19470
> [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark schema to read ORC
> table instead of ORC file schema
>
> With `convertMetastoreOrc=true`, Spark 2.3 will become stabler and faster.
> Also, it's the default Spark way to handle Parquet.
>
> BTW, thank you for looking at SPARK-22267. So far, I'm not looking at that
> issue.
> If we have a fix for SPARK-22267 in Spark 2.3, it would be great!
>
> Bests,
> Dongjoon.
>
>
> On Tue, Nov 14, 2017 at 3:46 AM, Mark Petruska <petruska.m...@gmail.com>
> wrote:
>
>>   Hi,
>> I'm very new to spark development, and would like to get guidance from
>> more experienced members.
>> Sorry this email will be long as I try to explain the details.
>>
>> Started to investigate the issue SPARK-22267
>> <https://issues.apache.org/jira/browse/SPARK-22267>; added some test
>> cases to highlight the problem in the PR
>> <https://github.com/apache/spark/pull/19744>. Here are my findings:
>>
>> - for parquet the test case succeeds as expected
>>
>> - the sql test case for orc:
>>     - when CONVERT_METASTORE_ORC is set to "true" the data fields are
>> presented in the desired order
>>     - when it is "false" the columns are read in the wrong order
>>     - Reason: when `isConvertible` returns true in `RelationConversions`
>> the plan executes `convertToLogicalRelation`, which in turn uses
>> `OrcFileFormat` to read the data; otherwise it uses the classes in
>> "hive-exec:1.2.1".
>>
>> - the HadoopRDD test case was added to further investigate the parameter
>> values to discover a working combination, but unfortunately no combination
>> of "serialization.ddl" and "columns" result in success. It seems that those
>> fields do not have any effect on the order of the resulting data fields.
>>
>>
>> At this point I do not see any option to fix this issue without risking
>> "backward compatibility" problems.
>> The possible actions (as I see them):
>> - link a new version of "hive-exec": surely this bug has been fixed in a
>> newer version
>> - use `OrcFileFormat` for reading orc data regardless of the setting of
>> CONVERT_METASTORE_ORC
>> - also there's an `OrcNewInputFormat` class in "hive-exec", but it
>> implements an InputFormat interface from a different package, hence it is
>> incompatible with HadoopRDD at the moment
>>
>> Please help me. Did I miss some viable options?
>>
>> Thanks,
>> Mark
>>
>>
>

Re: SPARK-22267 issue: Spark SQL incorrectly reads ORC file when column order is different

Reply via email to