[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-10-07 Thread GitBox


wjones1 commented on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-705052002


   @emkornfield  Thanks for catching that. I've fixed the formatting issues. 
Looks like it's not just the windows R checks that are failing.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-10-02 Thread GitBox


wjones1 commented on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-703043604


   Got the actions to rerun, but some are still failing. As far as I can tell, 
these failures aren't due to the changes in this PR. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-09-19 Thread GitBox


wjones1 commented on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-695343480


   I'm back on this for the weekend and will be back as needed the week after 
next.
   
   @jorisvandenbossche I can confirm that once I merge in the latest changes 
from apache master, I am getting the batches to be the expected size (and 
spanning across the parquet chunks). I have updated the 
`test_iter_batches_columns_reader` unit test accordingly.
   
   However, I have found that if the columns selected include a categorical 
column, then it reverts back to the behavior I was seeing before. You should be 
able to reproduce this by editing line 178 of the parquet tests to include 
`categorical=True`, and will find that the `test 
test_iter_batches_columns_reader` will then fail. I'm fine leaving this fix for 
a future story.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-07-16 Thread GitBox


wjones1 commented on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-659710775


   I have the unittests passing locally but they seem to be failing in CI. I 
probably need to rebase again and test. Will do so when I get some time this 
weekend.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-07-15 Thread GitBox


wjones1 commented on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-659157816


   On second thought, I think if users was consistent batch_sizes they can 
probably add that functionality in a wrapping generator.
   
   I have adjusted my tests to the new behavior and edited the docstring to 
indicate that the `batch_size` parameter is a maximum.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-07-15 Thread GitBox


wjones1 commented on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-659143356


   So it appears there were changes to the underlying implementation of 
RecordBatchReader. Prior to these changes, it would yield record batches with 
the exact batch size (if possible). So for a `batch_size` of 900 on a file 
written with `chunk_size` of 1,000, it would yield batches of: 900, 900, 900, 
900, ... rows. Now it yields slices that are aligned with the rowgroups, so the 
same parameters would yield batches with rowcounts: 900, 100, 900, 100, and so 
on.
   
   I'm not 100% sure if we care about the exact number of rows returned, but 
for now I'm leaning towards yes. Open to feedback on that. In the meantime, I 
will push changes soon that will stitch together the batches to yield 
consistent rowcounts.
   
   A cool side affect of these changes is is that it gets around the bug I 
mentioned earlier that would have blocked support for categorical columns in 
this method.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-07-03 Thread GitBox


wjones1 commented on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-653687403


   Looking at the code, no longer think this `batch_size` parameter actually 
would affect those other read methods.
   
   There are a few different "batch_size" parameters floating around the 
`reader.cc`, but there's only one reference to the one in `reader_properties_` 
(`ArrowReaderProperties`): 
   
https://github.com/apache/arrow/blob/edf24290046d95967d620104c5238f30ff032b6d/cpp/src/parquet/arrow/reader.cc#L797-L799
   
   As far as I can tell, that's exclusively on the code path for the 
RecordBatchReader, and not the other readers. So I don't think we need to add 
the parameter to those other methods.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-06-29 Thread GitBox


wjones1 commented on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-651140872


   Actually @jorisvandenbossche, I agree we should probably just add in the 
batch_size argument (with a sensible default) to those other methods. Took me a 
while to understand what you were getting at, but by providing those parameters 
we avoid the issue where `iter_batches()` calls could mess with the underlying 
parameters when calling the other methods.
   
   I will implement those soon. I would especially welcome input on what the 
docstring should be for the `batch_size` parameter for those other methods 
(what is the user-facing impact of changing it?) and what the default value 
should be.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-06-26 Thread GitBox


wjones1 commented on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-650474759


   RE: @jorisvandenbossche 
   > Same question as in the other PR: does setting the batch size also 
influence existing methods like `read` or `read_row_group` ? Should we add that 
keyword there as well?
   
   My one hesitation to adding them is that it's not clear to me what the 
effect would be on the execution. For `iter_batches()` the effect of 
`batch_size` is quite obvious, but I'm not sure about these other methods. 
   
   After quick search of the Apache Arrow docs the only explanation I saw on 
the batch size parameter was this:
   
   >  [The maximum row count for scanned record batches. If scanned record 
batches are overflowing memory then this method can be called to reduce their 
size.](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html?highlight=batch_size)
   
   
   
   If I can find a good explanation for it or if you have one, I'd be happy to 
add the `batch_size` parameter to the `read()` and `read_row_group()` methods.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-06-26 Thread GitBox


wjones1 commented on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-650472498


   Apologies, I've been away for a bit. I thought I had invited @sonthonaxrk as 
a collaborator on my fork, but perhaps that did go through.
   
   Addressed the minor feedback.
   
   Also retested the issue I was having where tests would fail if the 
`batch_size` and `chunk_size` weren't aligned, and I can't reproduce that error 
any more. Hopefully that means it was fixed and I'm not forgetting something 
important.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-04-25 Thread GitBox


wjones1 commented on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-619463693


   I found the cause of the test failure: If the `batch_size` isn't aligned 
with the `chunk_size`, categorical columns will fail with the error:
   ```
   pyarrow.lib.ArrowNotImplementedError: This class cannot yet iterate chunked 
arrays
   ```
   
   I think this means categorical columns/DictionaryArray columns aren't 
supported by this method for now, except if you are able to align the 
`batch_size` with `chunk_size`. 
   
   Is it possible or even common that `chunk_size` might be variable within a 
file?
   
   (The reason we were seeing the error in Python 3.5 and not in later Python 
versions is I was selecting a subset of columns using indices, and the ordering 
of columns changed between Python versions. I think because of the change in 
dictionary ordering in 3.6+. I've instead moved to have the offending test run 
on all columns.)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-04-25 Thread GitBox


wjones1 commented on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-619437378


   Two failing checks right now.
   
   For the linting one, it seems to be alarmed by some Rust code that I didn't 
touch. Am I missing something in that output?
   
   For the Python tests, getting [this 
error](https://github.com/apache/arrow/pull/6979/checks?check_run_id=618464479#step:7:2128):
   ```
   pyarrow.lib.ArrowNotImplementedError: This class cannot yet iterate chunked 
arrays
   ```
   When I run the same test locally, it succeeds. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org