[jira] [Comment Edited] (HUDI-5807) HoodieSparkParquetReader is not appending partition-path values

Lin Liu (Jira) Tue, 12 Sep 2023 16:11:04 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-5807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764420#comment-17764420
 ]


Lin Liu edited comment on HUDI-5807 at 9/12/23 11:10 PM:
---------------------------------------------------------

So far I have found the cause of the bug:
 # During the write no partition_path is added to the base_file and log files.
 # During the read, the base file reader can append the partition path into 
record; but the log file reader does not have the mechanism to add the 
partition_path field into the payload.
 # During merging, the record from log file is output, whose partition_path 
field is NULL.

 

I have tried to add the partition_path value to the underlying payload, that 
is, InternalRow. What we have is the HoodieInternalRowUtils.RowUpdater, which 
can help set a field, with function `update(field_index, value)`. However, 
there is no efficient way to find the field index , other than searching 
through the fields, whose complexity is linear, which is definitely bad for 
large number of records. We have to find other ways or build some ways to make 
this easy.

We should discuss the possible solutions.

 


was (Author: JIRAUSER301185):
So far I have found the cause of the bug:
 # During the write no partition_path is added to the base_file and log files.
 # During the read, the base file reader can append the partition path into 
record; but the log file reader does not have the mechanism to add the 
partition_path field into the payload.
 # During merging, the record from log file is output, whose partition_path 
field is NULL.

 

I have checked the solution to add the partition path to the underlying payload 
InternalRow. That is, the HoodieInternalRowUtils.RowUpdater can help set one 
field, with function `update(field_index, value)`. There is no efficient way to 
find the field index; search for the field index is linear, which is bad for 
large number of records.

We should discuss the possible solutions.

 

> HoodieSparkParquetReader is not appending partition-path values
> ---------------------------------------------------------------
>
>                 Key: HUDI-5807
>                 URL: https://issues.apache.org/jira/browse/HUDI-5807
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: spark
>    Affects Versions: 0.13.0
>            Reporter: Alexey Kudinkin
>            Assignee: Lin Liu
>            Priority: Blocker
>             Fix For: 1.0.0
>
>
> Current implementation of HoodieSparkParquetReader isn't supporting the case 
> when "hoodie.datasource.write.drop.partition.columns" is set to true.
> In that case partition-path values are expected to be parsed from 
> partition-path and be injected w/in the File Reader (this is behavior of 
> Spark's own readers)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HUDI-5807) HoodieSparkParquetReader is not appending partition-path values

Reply via email to