asayers opened a new issue, #1708:
URL: https://github.com/apache/arrow-rs/issues/1708

   **Describe the bug**
   
   As far as I can tell, the DELTA_BINARY_PACKED decoder can sometimes get into 
a "runaway" state, where the values it produces get more and more different 
from the correct value.
   
   **To Reproduce**
   
   I've uploaded two small parquet files:
   
   * http://www.asayers.com/files/good.parquet (7 KiB)
   * http://www.asayers.com/files/bad.parquet (4 KiB)
   
   They both contain the same data: a bunch of numbers in a single INT64 
column.  They were both produced using the parquet crate (HEAD, 3135a53e).
   
   * In good.parquet they're encoded as PLAIN RLE.
   * In bad.parquet they're encoded as DELTA_BINARY_PACKED RLE.
   
   Here's a program which reads the data back and prints it:
   
   ```rust
   fn main() {
       let path = std::env::args().nth(1).unwrap();
       let file = std::fs::File::open(path).unwrap();
       let rdr = 
parquet::file::serialized_reader::SerializedFileReader::new(file).unwrap();
       for row in rdr.into_iter() {
           use parquet::record::RowAccessor;
           println!("{}", row.get_long(0).unwrap());
       }
   }
   ```
   
   Running this program on good.parquet gives back the correct numbers, ending 
like so:
   
   ```
   1468797175075281492
   1468797175076709400
   1468797175076902975
   1468797175077250207
   1468797175077829776
   1468797175077980647
   1468797175078872591
   1468797175083351616
   1468797175804941300
   1468797178324785080
   ```
   
   Running on bad.parquet looks correct at first, but then towards the end it 
starts to go wacky:
   
   ```
   1468797175059406818
   1468797175059715342
   1468797175059831514
   1468797175060873770
   2937593677245033068
   4406390179428921343
   5875186685908006897
   7343983196682139470
   8812779707456502586
   -8165167851184027777
   -6696371327525264784
   -5227574803866540077
   ```
   
   **More context**
   
   I spent some time yesterday minimising this reproducer (it started at 1.4 
GiB).  However, while I was able to reproduce the bug with a variety of data, I 
was never able to do it with fewer than 1000 rows.  I hope that gives you a 
clue to the cause!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to