asayers opened a new issue, #1708: URL: https://github.com/apache/arrow-rs/issues/1708
**Describe the bug** As far as I can tell, the DELTA_BINARY_PACKED decoder can sometimes get into a "runaway" state, where the values it produces get more and more different from the correct value. **To Reproduce** I've uploaded two small parquet files: * http://www.asayers.com/files/good.parquet (7 KiB) * http://www.asayers.com/files/bad.parquet (4 KiB) They both contain the same data: a bunch of numbers in a single INT64 column. They were both produced using the parquet crate (HEAD, 3135a53e). * In good.parquet they're encoded as PLAIN RLE. * In bad.parquet they're encoded as DELTA_BINARY_PACKED RLE. Here's a program which reads the data back and prints it: ```rust fn main() { let path = std::env::args().nth(1).unwrap(); let file = std::fs::File::open(path).unwrap(); let rdr = parquet::file::serialized_reader::SerializedFileReader::new(file).unwrap(); for row in rdr.into_iter() { use parquet::record::RowAccessor; println!("{}", row.get_long(0).unwrap()); } } ``` Running this program on good.parquet gives back the correct numbers, ending like so: ``` 1468797175075281492 1468797175076709400 1468797175076902975 1468797175077250207 1468797175077829776 1468797175077980647 1468797175078872591 1468797175083351616 1468797175804941300 1468797178324785080 ``` Running on bad.parquet looks correct at first, but then towards the end it starts to go wacky: ``` 1468797175059406818 1468797175059715342 1468797175059831514 1468797175060873770 2937593677245033068 4406390179428921343 5875186685908006897 7343983196682139470 8812779707456502586 -8165167851184027777 -6696371327525264784 -5227574803866540077 ``` **More context** I spent some time yesterday minimising this reproducer (it started at 1.4 GiB). However, while I was able to reproduce the bug with a variety of data, I was never able to do it with fewer than 1000 rows. I hope that gives you a clue to the cause! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
