[GitHub] [arrow] jorisvandenbossche commented on pull request #13409: ARROW-16616: [Python] Lazy datasets filtering

GitBox Wed, 26 Oct 2022 03:13:20 -0700


jorisvandenbossche commented on PR #13409:
URL: https://github.com/apache/arrow/pull/13409#issuecomment-1291807667


   From https://github.com/apache/arrow/pull/13409#discussion_r914281708
   
   >> If `self._filter` can be `None` then what is the advantage of creating a 
separate `FilteredDataset` instead of just adding `_filter` to the existing 
`Dataset`?
   >
   > That was the original implementation, and I was asked to explicitly move 
it in a dedicated class. Which I think in the end makes sense, better have a 
single responsibility per class.
   
   Could you expand on this a bit? (I don't know where or why it was asked to 
move to a dedicated class, the only reference I find in the other PR is the 
question if this shouldn't live on the Scanner) 
   
   It seems to me that if we want to expose a helper `filter()` method 
(although it doesn't give that much of value compared to passing the filter to 
the method that actually will do the scanning, i.e. `to_table(..)`, 
`to_batches(..)`, etc), adding it just to the main Dataset class will expose 
the least amount of new API that we "lock in" (it avoids deciding now if we 
want some "Query" like class)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on pull request #13409: ARROW-16616: [Python] Lazy datasets filtering

Reply via email to