thisisnic commented on pull request #12083: URL: https://github.com/apache/arrow/pull/12083#issuecomment-1012889685
Yeah, spot on, and there is an extra level of complexity added here due to the fact that the way in which we infer column names from a schema is a bit incorrect (but this is being dealt with as part of another issue) Are you happy with leaving the existing code which derives column names from a schema as it was before any of the changes in this PR, and as you say, raise an error if CsvReadOptions$create() is used for read_options but is not consistent with the schema? There's a PR which was just merged that does a similar thing but relates to partitioning, and is a great example of this kind of thing being done really nicely: https://github.com/apache/arrow/blob/99f7c3cf3e6c2a9555ceff3d48ef73e485ede546/r/R/dataset-factory.R#L85-L95 What you said about where the code for this should go sounds right and thanks for volunteering to update the docs too! Thanks for sticking with this even though we've drastically changed what we're doing to resolve this - even though we're not using the code from your previous solutions, the process of testing them out and the surrounding discussion has been really helpful for identifying some serious enhancements that can be made to how the schema and column name components interact and also how we direct our users to work with `open_dataset()`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
