Julian, How do you plan to use Gandiva in Apache Calcite? On Tue, Apr 20, 2021 at 9:57 PM Julian Hyde <jh...@apache.org> wrote:
> We would love to use Gandiva in Apache Calcite [1] but we are blocked > because the JAR on Maven Central doesn't work on macOS, Linux or > Windows [2] and there seems to be no interest in fixing the problem. > So I doubt whether anyone is using Gandiva in production (unless they > have built the artifacts for themselves). > > Once Gandiva is working for us we will have an opinion about caching. > > Julian > > [1] https://issues.apache.org/jira/browse/CALCITE-2040 > > [2] https://issues.apache.org/jira/browse/ARROW-11135 > > On Tue, Apr 20, 2021 at 2:58 AM Vivekanand Vellanki <vi...@dremio.com> > wrote: > > > > We are considering using an on-disk - this is planned for later. Even > with > > an on-disk cache, we still need an eviction policy to ensure that Gandiva > > doesn't use up the entire disk. > > > > For now, we are assuming that we can measure the cost accurately - the > > assumption is that the query engine would use Gandiva on a thread that is > > pinned to a core. For other engines, an alternate estimate of cost can be > > the complexity of the expression. > > > > On Tue, Apr 20, 2021 at 2:46 PM Antoine Pitrou <anto...@python.org> > wrote: > > > > > > > > Hi Projjal, > > > > > > The main issue here is to compute the cost accurately (is it > computation > > > runtime? memory footprint? can you measure the computation time > > > accurately, regardless of system noise - e.g. other threads and > > > processes?). > > > > > > Intuitively, if the LRU cache shows too many misses, a simple measure > is > > > to increase its size ;-) > > > > > > Last question: have you considered a second level on-disk cache? Numba > > > uses such a cache with good results: > > > https://numba.readthedocs.io/en/stable/developer/caching.html > > > > > > Regards > > > > > > Antoine. > > > > > > > > > Le 20/04/2021 à 06:28, Projjal Chanda a écrit : > > > > Hi, > > > > We currently have a cache[1] in gandiva that caches the built > projector > > > or filter module with LRU based eviction policy. However since the > cost of > > > building different expressions is not uniform it makes sense to have a > > > different eviction policy that takes into account an associated cost > of a > > > cache miss (while also discounting the items which have not been > recently > > > used). We are planning to use an algorithm called GreedyDual-Size > Algorithm > > > [2] which seems fit for the purpose. The algorithm is quite simple - > > > > Each item has a cost (build time in our case) and item with lowest > cost > > > (c_min) is evicted. All other items cost are deducted by (c_min) > > > > On cache hit, the item cost is restored to the original value > > > > > > > > This can be implemented using a priority queue and an efficient > > > implementation of this can handle both cache hit and eviction in > O(logk) > > > time. > > > > > > > > Does anybody have any other suggestions or ideas on this? > > > > > > > > [1] > https://github.com/apache/arrow/blob/master/cpp/src/gandiva/cache.h > > > <https://github.com/apache/arrow/blob/master/cpp/src/gandiva/cache.h> > > > > [2] > > > > https://www.usenix.org/legacy/publications/library/proceedings/usits97/full_papers/cao/cao_html/node8.html > > > < > > > > https://www.usenix.org/legacy/publications/library/proceedings/usits97/full_papers/cao/cao_html/node8.html > > > > > > > > > > > > Regards, > > > > Projjal > > > > > > > >