taiyang-li commented on issue #6588: URL: https://github.com/apache/incubator-gluten/issues/6588#issuecomment-2264790703
原因: 1. insert阶段 当`SET spark.sql.storeAssignmentPolicy = LEGACY`时,insert insert算子,对orc格式就是ORCBlockOutputFormat算子, 是以FakeRow输出CH Block作为其header的。导致ORCBlockOutputFormat生成orc文件在类型上和建表sql有diff。 old_header对应FakeRow输出的CH Block的schema。其中字段c的类型是`Tuple(String, Nullable(String))` new_header对应表元数据中的schema。其中字段c类型是`Tuple(d Nullable(String), e Nullable(String))` ``` old_header:a#27 Int32 Int32(size = 0), b#28 Map(String, String) Map(size = 0, Array(size = 0, UInt64(size = 0), Tuple(size = 0, String(size = 0), String(size = 0)))), c#29 Tuple(String, Nullable(String)) Tuple(size = 0, String(size = 0), Nullable(size = 0, String(size = 0), UInt8(size = 0))) xxx new_header:a Int32 Int32(size = 0), b Map(String, Nullable(String)) Map(size = 0, Array(size = 0, UInt64(size = 0), Tuple(size = 0, String(size = 0), Nullable(size = 0, String(size = 0), UInt8(size = 0))))), c Tuple(d Nullable(String), e Nullable(String)) Tuple(size = 0, Nullable(size = 0, String(size = 0), UInt8(size = 0)), Nullable(size = 0, String(size = 0), UInt8(size = 0))) ``` 2. select阶段 在1的基础上,读取orc文件时,实际上是以`Tuple(d Nullable(String), e Nullable(String))`为目标类型读取文件中类型为`Tuple(String, Nullable(String))`的字段,导致读取不到正确数据。整体上表现为insert和select数据不一致。 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
