[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-705052002 @emkornfield Thanks for catching that. I've fixed the formatting issues. Looks like it's not just the windows R checks that are failing. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-703043604 Got the actions to rerun, but some are still failing. As far as I can tell, these failures aren't due to the changes in this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-695343480 I'm back on this for the weekend and will be back as needed the week after next. @jorisvandenbossche I can confirm that once I merge in the latest changes from apache master, I am getting the batches to be the expected size (and spanning across the parquet chunks). I have updated the `test_iter_batches_columns_reader` unit test accordingly. However, I have found that if the columns selected include a categorical column, then it reverts back to the behavior I was seeing before. You should be able to reproduce this by editing line 178 of the parquet tests to include `categorical=True`, and will find that the `test test_iter_batches_columns_reader` will then fail. I'm fine leaving this fix for a future story. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-659710775 I have the unittests passing locally but they seem to be failing in CI. I probably need to rebase again and test. Will do so when I get some time this weekend. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-659157816 On second thought, I think if users was consistent batch_sizes they can probably add that functionality in a wrapping generator. I have adjusted my tests to the new behavior and edited the docstring to indicate that the `batch_size` parameter is a maximum. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-659143356 So it appears there were changes to the underlying implementation of RecordBatchReader. Prior to these changes, it would yield record batches with the exact batch size (if possible). So for a `batch_size` of 900 on a file written with `chunk_size` of 1,000, it would yield batches of: 900, 900, 900, 900, ... rows. Now it yields slices that are aligned with the rowgroups, so the same parameters would yield batches with rowcounts: 900, 100, 900, 100, and so on. I'm not 100% sure if we care about the exact number of rows returned, but for now I'm leaning towards yes. Open to feedback on that. In the meantime, I will push changes soon that will stitch together the batches to yield consistent rowcounts. A cool side affect of these changes is is that it gets around the bug I mentioned earlier that would have blocked support for categorical columns in this method. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-653687403 Looking at the code, no longer think this `batch_size` parameter actually would affect those other read methods. There are a few different "batch_size" parameters floating around the `reader.cc`, but there's only one reference to the one in `reader_properties_` (`ArrowReaderProperties`): https://github.com/apache/arrow/blob/edf24290046d95967d620104c5238f30ff032b6d/cpp/src/parquet/arrow/reader.cc#L797-L799 As far as I can tell, that's exclusively on the code path for the RecordBatchReader, and not the other readers. So I don't think we need to add the parameter to those other methods. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-651140872 Actually @jorisvandenbossche, I agree we should probably just add in the batch_size argument (with a sensible default) to those other methods. Took me a while to understand what you were getting at, but by providing those parameters we avoid the issue where `iter_batches()` calls could mess with the underlying parameters when calling the other methods. I will implement those soon. I would especially welcome input on what the docstring should be for the `batch_size` parameter for those other methods (what is the user-facing impact of changing it?) and what the default value should be. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-650474759 RE: @jorisvandenbossche > Same question as in the other PR: does setting the batch size also influence existing methods like `read` or `read_row_group` ? Should we add that keyword there as well? My one hesitation to adding them is that it's not clear to me what the effect would be on the execution. For `iter_batches()` the effect of `batch_size` is quite obvious, but I'm not sure about these other methods. After quick search of the Apache Arrow docs the only explanation I saw on the batch size parameter was this: > [The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size.](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html?highlight=batch_size) If I can find a good explanation for it or if you have one, I'd be happy to add the `batch_size` parameter to the `read()` and `read_row_group()` methods. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-650472498 Apologies, I've been away for a bit. I thought I had invited @sonthonaxrk as a collaborator on my fork, but perhaps that did go through. Addressed the minor feedback. Also retested the issue I was having where tests would fail if the `batch_size` and `chunk_size` weren't aligned, and I can't reproduce that error any more. Hopefully that means it was fixed and I'm not forgetting something important. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-619463693 I found the cause of the test failure: If the `batch_size` isn't aligned with the `chunk_size`, categorical columns will fail with the error: ``` pyarrow.lib.ArrowNotImplementedError: This class cannot yet iterate chunked arrays ``` I think this means categorical columns/DictionaryArray columns aren't supported by this method for now, except if you are able to align the `batch_size` with `chunk_size`. Is it possible or even common that `chunk_size` might be variable within a file? (The reason we were seeing the error in Python 3.5 and not in later Python versions is I was selecting a subset of columns using indices, and the ordering of columns changed between Python versions. I think because of the change in dictionary ordering in 3.6+. I've instead moved to have the offending test run on all columns.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-619437378 Two failing checks right now. For the linting one, it seems to be alarmed by some Rust code that I didn't touch. Am I missing something in that output? For the Python tests, getting [this error](https://github.com/apache/arrow/pull/6979/checks?check_run_id=618464479#step:7:2128): ``` pyarrow.lib.ArrowNotImplementedError: This class cannot yet iterate chunked arrays ``` When I run the same test locally, it succeeds. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org