[ 
https://issues.apache.org/jira/browse/IMPALA-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775404#comment-16775404
 ] 

ASF subversion and git services commented on IMPALA-7265:
---------------------------------------------------------

Commit dce82e4e018d1944ff19bb6f87139b51c1b0287e in impala's branch 
refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=dce82e4 ]

IMPALA-8178: Disable file handle cache for HDFS erasure coded files

Testing on an erasure coded minicluster has revealed that each
file handle for an erasure coded files uses about 3MB of native
memory. This shows up as "java.nio:type=BufferPool,name=direct"
in the /jmx endpoint (here showing the output when 608 handles
are open):

{
  "name": "java.nio:type=BufferPool,name=direct",
  "modelerType": "sun.management.ManagementFactoryHelper$1",
  "Name": "direct",
  "TotalCapacity": 1921048960,
  "MemoryUsed": 1921048961,
  "Count": 633,
  "ObjectName": "java.nio:type=BufferPool,name=direct"
}

The memory is not released or reduced by a call to unbuffer(),
so these file handles are not suitable for long term caching.
HDFS-14308 tracks the implementation of unbuffer() for
DFSStripedInputStream. This issue showed up when remote
file handle caching was enabled in IMPALA-7265, as erasure
coded files are always scheduled to be remote (IMPALA-7019).

This disables file handle caching for erasure coded files,
which requires plumbing through the information about which
ScanRanges are accessing erasure coded files.

With this change, core tests pass on an erasure coded system.

Change-Id: I8c761e08aacc952de0033a4c91e07f15c8ec96da
Reviewed-on: http://gerrit.cloudera.org:8080/12552
Reviewed-by: Joe McDonnell <joemcdonn...@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>


> Cache remote file handles
> -------------------------
>
>                 Key: IMPALA-7265
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7265
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 3.1.0
>            Reporter: Joe McDonnell
>            Assignee: Joe McDonnell
>            Priority: Critical
>             Fix For: Impala 3.2.0
>
>
> The file handle cache currently does not allow caching remote file handles. 
> This means that clusters that have a lot of remote reads can suffer from 
> overloading the NameNode. Impala should be able to cache remote file handles.
> There are some open questions about remote file handles and whether they 
> behave differently from local file handles. In particular:
>  # Is there any resource constraint on the number of remote file handles 
> open? (e.g. do they maintain a network connection?)
>  # Are there any semantic differences in how remote file handles behave when 
> files are deleted, overwritten, or appended?
>  # Are there any extra failure cases for remote file handles? (i.e. if a 
> machine goes down or a remote file handle is left open for an extended period 
> of time)
> The form of caching will depend on the answers, but at the very least, it 
> should be possible to cache a remote file handle at the level of a query so 
> that a Parquet file with multiple columns can share file handles.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to