alamb commented on issue #8824: URL: https://github.com/apache/datafusion/issues/8824#issuecomment-2079908068
> I came across this issue since I was looking for a way to deal with CSV files containing comments (in my case: lines starting with `# `). Please let me know if I should open a new issue for that. I think a new issue would be good as this one describes something different > I was originally looking for a way to replace the reader or hook after the builtin one, but from reading above comments this probably does not fit `datafusion`'s input handling. I think you would have to provide your own TableProvider / format for the ListingTable provider if you wanted to do this today > _Skip first n rows_ wouldn't help in my case, but if there was a way to hook into the CSV reader by providing a line-based filter function it might (naively: a `CsvOptions` field like `filter: FnOnce(&str) -> bool`). Maybe something like this could even be generalized to a line-based transformer function a la `filter_map` (probably would need to return some `Option<Cow>` to not penalize use cases which only filter, but do not transform). If OP's lines to skip can be clearly identified this might be able to address their use case as well. Since the readers are based on streams of data, I do wonder if we could implement some sort of filtering that happened prior to parsing that would let you transform the input however you wanted 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
