alamb commented on issue #7698: URL: https://github.com/apache/arrow-datafusion/issues/7698#issuecomment-1808638503
> For instance, in our case schema is being changed many times rarely then data and we can cache it for a long period of time We do something similar to this in IOx (cache schemas that we know don't change rather than recomputing them) It is my opinion that in order to make DFSchema behave well and not be a bottleneck we will need to more fundamentally restructure how it works. Right now the amount of copying required is substantial as has been pointed out several times on this thread. I think with sufficient diligence we could avoid almost all copies when manipulating DFSchema and then the extra complexity of adding a cache or other techniques would become unnecessary. > I've additionally changed the type of the precomputed qualified_name from String to Arc<String> after the above tests. Total planning time reduced to 75% of the previous iteration. But I think it is still far from optimal. I think this is a great idea. I think optimizing for the case of the same, reused qualifier, is a very good idea. What do people think about the approach described on https://github.com/apache/arrow-datafusion/pull/7944? I (admittedly biasedly) think that approach would eliminate almost all allocations (instead it would be ref count updates). We can extend it to incorporate ideas like pre-caching qualified names and hash sets for column checks, and I think it could be pretty fast -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
