wjones1 commented on a change in pull request #6979: URL: https://github.com/apache/arrow/pull/6979#discussion_r415130068
########## File path: python/pyarrow/tests/test_parquet.py ########## @@ -179,6 +179,99 @@ def alltypes_sample(size=10000, seed=0, categorical=False): @pytest.mark.pandas +def test_iter_batches_columns_reader(tempdir): + df = alltypes_sample(size=10000, categorical=True) + + filename = tempdir / 'pandas_roundtrip.parquet' + arrow_table = pa.Table.from_pandas(df) + _write_table(arrow_table, filename, version="2.0", + coerce_timestamps='ms', chunk_size=1000) + + columns = df.columns[4:15] + + file_ = pq.ParquetFile(filename) + + batches = file_.iter_batches( + batch_size=500, + columns=columns + ) + + tm.assert_frame_equal( + next(batches).to_pandas(), + df.iloc[:500, :].loc[:, columns] + ) + + [email protected] [email protected]('chunk_size', [1000]) +def test_iter_batches_reader(tempdir, chunk_size): Review comment: Strangely, after I merged the latest changes from master, I am no longer seeing this issue with dictionary arrays. I definitely saw it in the original fork, so I think it may have actually been fixed (though not sure where). I've removed the dictionary array correction in the test and hopefully the CI should confirm what I am seeing. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
