avantgardnerio commented on code in PR #2885: URL: https://github.com/apache/arrow-datafusion/pull/2885#discussion_r925850259
########## datafusion/core/tests/sql/mod.rs: ########## @@ -499,6 +537,77 @@ async fn register_tpch_csv(ctx: &SessionContext, table: &str) -> Result<()> { Ok(()) } +async fn register_tpch_csv_data( + ctx: &SessionContext, + table_name: &str, + data: &str, +) -> Result<()> { + let schema = Arc::new(get_tpch_table_schema(table_name)); Review Comment: I started with the TPC-H `.csv`s that were checked in already, and even added some of my own, then started adding data to existing ones, then updated failed tests with new expected results, then realized I'd had the feeling before where I'm going down the road hell that is shared test data. I see it being particularly bad for aggregates (what's the proper expected result for the sum of all sales with parts from the middle east?) I started to break out `csv`s by folder, but that seemed cumbersome, so finally I thought that this function might be a useful tool to keep the data close to the test itself, like this: https://github.com/spaceandtimelabs/arrow-datafusion/blob/563d87d1c6413e611894619c2bc472b396d75c3d/datafusion/core/tests/sql/subqueries.rs#L171 I don't have strong opinions as to implementation, but I would like to avoid sharing one set of data between all integration tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org