Hello David Knupp, Dan Hecht,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/9626

to look at the new patch set (#3).

Change subject: IMPALA-6488: removes use-after-free bug in lib_cache
......................................................................

IMPALA-6488: removes use-after-free bug in lib_cache

Several recent runs have resulted in a boost mutex
invalid argument exception. The mutex in question is
the one that guards individual lib_cache entries
(LibCacheEntry::lock).

The exception is thrown due to the entry being deleted
by another thread prior to the current thread acquiring
the entry lock. This scenario happens when:
1) thread t1 looks up an existing entry e
   a. the lookup is guarded by a cache lock
   b. the cache lock is released (L356)
   c. e's lock is acquired on the next line and propagated
      up the call stack (look for the swaps)
   d. while e is locked, its use-count is incremented.
2) thread t2 deletes entry e
   a. the cache lock is acquired and e is looked up
   b. e's lock is acquired
   c. if e's usecount is 0 and was marked for removal, e is deleted.

If t2 runs following (1b), then t1 will acquire a lock on a deleted
entry (1c), causing the mutex exception.

There are two parts to the fix in this change:
(1) don't crash and (2) don't let concurrency regress.

1) remove 1b: keep the cache lock while acquiring e's lock.
The cache lock is then released and all other operations proceed as before.
Note that current lock ordering is still maintained (coarse to fine),
but now, the cache lock can be held while other threads block access
to the entry lock. When files have been copied, these operations
are short. While files are loading, accesses to the same entry
can serialize access to the entire cache.
2) add cache entry after its loaded.
Currently, on a cache miss, a new entry is created, locked,
added to the cache, and finally the cache is unlocked. The intent
is to allow other threads to concurrently load files for other entries.
However, now that the cache lock is held while the entry lock is held,
the loading thread will block other thread's from acquiring the same entry,
which will block access to the cache. The work-around is to release
the cache lock when loading a new entry and add the cache entry only when
its loaded. The workaround avoids expensive work while holding the
cache lock, but may do wasted work if multiple threads miss on the
same entry and load their entries in parallel.

Testing:
- ran core tests + hdfs
- added an end-to-end test that concurrently uses and drops/creates from
  the same lib-cache entry. with current settings, the use-after-free
  error is reliably reproduced.
- manual testing to examine concurrency of the following cases:
  - concurrent function creation from multiple lib files
    (stresses coordinator)
  - concurrent function use from multiple lib files
    (stresses backend)

Change-Id: I1f178a2114cb3969dbb920949ddff4e8220b742e
---
M be/src/runtime/lib-cache.cc
M be/src/runtime/lib-cache.h
M tests/query_test/test_udfs.py
3 files changed, 186 insertions(+), 72 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/9626/3
--
To view, visit http://gerrit.cloudera.org:8080/9626
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1f178a2114cb3969dbb920949ddff4e8220b742e
Gerrit-Change-Number: 9626
Gerrit-PatchSet: 3
Gerrit-Owner: Vuk Ercegovac <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: David Knupp <[email protected]>
Gerrit-Reviewer: Vuk Ercegovac <[email protected]>

Reply via email to