n3world edited a comment on pull request #10255:
URL: https://github.com/apache/arrow/pull/10255#issuecomment-833185925


   > That would be good. Eventually the dataset scanner will probably be 
getting a skip operation of some kind as well so that'll increase the pressure 
on [ARROW-8527](https://issues.apache.org/jira/browse/ARROW-8527). 
[ARROW-12598](https://issues.apache.org/jira/browse/ARROW-12598) is also 
(admittedly tangentially) related since you seem to be on a roll smile
   
   The only tricky part about a count(*) implementation with this is that 
skip_rows documented that it was skipping header rows which shouldn't be 
counted as part of a data row count. I feel like the row count operation would 
have to be a little different and maybe give an indicator for on which line the 
actual data rows start so that the header rows before that point could be 
skipped.
   
   Maybe a simpler solution would a set of two indexes: column names and first 
data row . While this doesn't allow arbitrary row skipping in the middle this 
would allow for the most common use cases, including skipping over valid rows 
to first desired row. With another option or operation could be used to count 
the number of data rows starting at first data row. The defaults would be 0, 1 
for when column names are part of the csv or -1, 0 when they are not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to