Re: [Gandiva] Replacing the LRU cache in gandiva

Projjal Chanda Wed, 21 Apr 2021 00:35:40 -0700

Hi Julian,
I replied on the ARROW jira. It seems the gandiva jar pushed to maven central 
during release is built differently than the nightly gandiva jars, hence the 
protobuf linking issue and that it only works on mac. We need to have a nightly 
build that produces a single jar with all supported native libraries (instead 
of the separate linux and mac jars we build today) and push the built 
gandiva-jar during release process. We will work on this.


Regards,
Projjal

> On 20-Apr-2021, at 9:57 PM, Julian Hyde <jh...@apache.org> wrote:
> 
> We would love to use Gandiva in Apache Calcite [1] but we are blocked
> because the JAR on Maven Central doesn't work on macOS, Linux or
> Windows  [2] and there seems to be no interest in fixing the problem.
> So I doubt whether anyone is using Gandiva in production (unless they
> have built the artifacts for themselves).
> 
> Once Gandiva is working for us we will have an opinion about caching.
> 
> Julian
> 
> [1] https://issues.apache.org/jira/browse/CALCITE-2040
> 
> [2] https://issues.apache.org/jira/browse/ARROW-11135
> 
> On Tue, Apr 20, 2021 at 2:58 AM Vivekanand Vellanki <vi...@dremio.com> wrote:
>> 
>> We are considering using an on-disk - this is planned for later. Even with
>> an on-disk cache, we still need an eviction policy to ensure that Gandiva
>> doesn't use up the entire disk.
>> 
>> For now, we are assuming that we can measure the cost accurately - the
>> assumption is that the query engine would use Gandiva on a thread that is
>> pinned to a core. For other engines, an alternate estimate of cost can be
>> the complexity of the expression.
>> 
>> On Tue, Apr 20, 2021 at 2:46 PM Antoine Pitrou <anto...@python.org> wrote:
>> 
>>> 
>>> Hi Projjal,
>>> 
>>> The main issue here is to compute the cost accurately (is it computation
>>> runtime? memory footprint? can you measure the computation time
>>> accurately, regardless of system noise - e.g. other threads and
>>> processes?).
>>> 
>>> Intuitively, if the LRU cache shows too many misses, a simple measure is
>>> to increase its size ;-)
>>> 
>>> Last question: have you considered a second level on-disk cache?  Numba
>>> uses such a cache with good results:
>>> https://numba.readthedocs.io/en/stable/developer/caching.html
>>> 
>>> Regards
>>> 
>>> Antoine.
>>> 
>>> 
>>> Le 20/04/2021 à 06:28, Projjal Chanda a écrit :
>>>> Hi,
>>>> We currently have a cache[1] in gandiva that caches the built projector
>>> or filter module with LRU based eviction policy. However since the cost of
>>> building different expressions is not uniform it makes sense to have a
>>> different eviction policy that takes into account an associated cost of a
>>> cache miss (while also discounting the items which have not been recently
>>> used). We are planning to use an algorithm called GreedyDual-Size Algorithm
>>> [2] which seems fit for the purpose. The algorithm is quite simple -
>>>> Each item has a cost (build time in our case) and item with lowest cost
>>> (c_min) is evicted. All other items cost are deducted by (c_min)
>>>> On cache hit, the item cost is restored to the original value
>>>> 
>>>> This can be implemented using a priority queue and an efficient
>>> implementation of this can handle both cache hit and eviction in O(logk)
>>> time.
>>>> 
>>>> Does anybody have any other suggestions or ideas on this?
>>>> 
>>>> [1] https://github.com/apache/arrow/blob/master/cpp/src/gandiva/cache.h
>>> <https://github.com/apache/arrow/blob/master/cpp/src/gandiva/cache.h>
>>>> [2]
>>> https://www.usenix.org/legacy/publications/library/proceedings/usits97/full_papers/cao/cao_html/node8.html
>>> <
>>> https://www.usenix.org/legacy/publications/library/proceedings/usits97/full_papers/cao/cao_html/node8.html
>>>> 
>>>> 
>>>> Regards,
>>>> Projjal
>>>> 
>>>

Re: [Gandiva] Replacing the LRU cache in gandiva

Reply via email to