[jira] [Commented] (HDFS-6698) try to optimize DFSInputStream.getFileLength()

Lars Hofhansl (JIRA) Fri, 31 Oct 2014 23:49:50 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192993#comment-14192993
 ]


Lars Hofhansl commented on HDFS-6698:
-------------------------------------

Now... I am not saying that we do not have work to in HBase:
* we're using one reader per HFile
* after a major compaction we have a single store file per column family (that 
file can be up to 20GB in size)
* we allow one thread using seek+read on that reader, other concurrent scanners 
will fall back to pread (see HBASE-7336).

For my test I did this:
* my test table had 2^25 (~32m) rows, in two regions, about 1GB on disk
* I tested this with Phoenix, which can break a query into parts and execute 
scans for the parts (that's where the parallel scanning on the same readers 
comes into play)
* I have short circuit reading enabled
* all data in the OS cache (HBase block cache not used)

This is not an uncommon scenario, though. The original poster cited 
scans(seek+read) + gets(pread) as a problem.

In either case, I'll post an updated patch to HDFS-6735 and we can take it from 
there.


> try to optimize DFSInputStream.getFileLength()
> ----------------------------------------------
>
>                 Key: HDFS-6698
>                 URL: https://issues.apache.org/jira/browse/HDFS-6698
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>    Affects Versions: 3.0.0
>            Reporter: Liang Xie
>            Assignee: Liang Xie
>         Attachments: HDFS-6698.txt, HDFS-6698.txt, HDFS-6698v2.txt, 
> HDFS-6698v2.txt, HDFS-6698v3.txt
>
>
> HBase prefers to invoke read() serving scan request, and invoke pread() 
> serving get reqeust. Because pread() almost holds no lock.
> Let's image there's a read() running, because the definition is:
> {code}
> public synchronized int read
> {code}
> so no other read() request could run concurrently, this is known, but pread() 
> also could not run...  because:
> {code}
>   public int read(long position, byte[] buffer, int offset, int length)
>     throws IOException {
>     // sanity checks
>     dfsClient.checkOpen();
>     if (closed) {
>       throw new IOException("Stream closed");
>     }
>     failures = 0;
>     long filelen = getFileLength();
> {code}
> the getFileLength() also needs lock.  so we need to figure out a no lock impl 
> for getFileLength() before HBase multi stream feature done. 
> [[email protected]]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-6698) try to optimize DFSInputStream.getFileLength()

Reply via email to