jonded94 commented on issue #9370:
URL: https://github.com/apache/arrow-rs/issues/9370#issuecomment-3861327615

   I have to go now, but this is what Claude spat out. Gonna open a PR later
   
   ● The test passes. Let me update the todo list and provide a summary of the 
fix.                                                                            
                                                                                
                                              
                                                                                
                                                                                
                                                                                
                                             
   ● The fix works. Here's a summary:                                           
                                                                                
                                                                                
                                             
                                                                                
                                                                                
                                                                                
                                             
     Bug Summary                                                                
                                                                                
                                                                                
                                             
                                                                                
                                                                                
                                                                                
                                             
     The issue was in 
/home/anrs/workspace/arrow-rs/parquet/src/column/reader.rs:287-382 
(skip_records function).                                                        
                                                                                
                                    
                                                                                
                                                                                
                                                                                
                                             
     Root cause: When skip_records completed successfully after crossing page 
boundaries, the has_partial state in the RepetitionLevelDecoder could 
incorrectly remain true. This happened when:                                    
                                                         
     1. The skip operation exhausted a page where has_record_delimiter was 
false                                                                           
                                                                                
                                                  
     2. The skip found the remaining records on the next page by counting a 
delimiter at index 0                                                            
                                                                                
                                                 
                                                                                
                                                                                
                                                                                
                                             
     When a subsequent read_records(1) was called, the stale has_partial=true 
state caused count_records to incorrectly interpret the first repetition level 
(0) at index 0 as ending a "phantom" partial record, returning (1 record, 0 
levels, 0 values) instead of properly reading the   
     actual record data.                                                        
                                                                                
                                                                                
                                             
                                                                                
                                                                                
                                                                                
                                             
     The Fix                                                                    
                                                                                
                                                                                
                                             
                                                                                
                                                                                
                                                                                
                                             
     Added code at the end of skip_records to reset the partial record state 
when all requested records have been successfully skipped:                      
                                                                                
                                                
                                                                                
                                                                                
                                                                                
                                             
     // After successfully skipping all requested records, we are positioned at 
a record                                                                        
                                                                                
                                             
     // boundary. Reset any partial record state to ensure subsequent read 
operations                                                                      
                                                                                
                                                  
     // start fresh. This handles edge cases where the skip operation crossed 
page                                                                            
                                                                                
                                               
     // boundaries and the has_partial state was not properly reset.            
                                                                                
                                                                                
                                             
     if remaining_records == 0 {                                                
                                                                                
                                                                                
                                             
         if let Some(decoder) = self.rep_level_decoder.as_mut() {               
                                                                                
                                                                                
                                             
             decoder.flush_partial();                                           
                                                                                
                                                                                
                                             
         }                                                                      
                                                                                
                                                                                
                                             
     }                                                                          
                                                                                
                                                                                
                                             
                                                                                
                                                                                
                                                                                
                                             
     This ensures that after skip_records completes, we're at a clean record 
boundary with no lingering partial record state, fixing the array length 
mismatch in StructArrayReader.                                                  
                                                       
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to