[GitHub] incubator-impala pull request #6: Branch 2.10.0

helifu Mon, 30 Oct 2017 19:54:01 -0700

GitHub user helifu opened a pull request:

    https://github.com/apache/incubator-impala/pull/6


    Branch 2.10.0

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/helifu/incubator-impala branch-2.10.0

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-impala/pull/6.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6
    
----
commit 8049f811379c6f316520934fa7c495a4fc54d45d
Author: Taras Bobrovytsky <[email protected]>
Date:   2017-06-01T00:53:01Z

    Update VERSION to 2.9.0 to begin release candidate testing
    
    Change-Id: I88b03479ae1d73afc9e3f5883ee09ae2f9bcfe09

commit 4086f2c84de754d0a4a0ea87c0ee49b7e6eb469f
Author: Sailesh Mukil <[email protected]>
Date:   2017-04-11T00:08:01Z

    IMPALA-5333: Add support for Impala to work with ADLS
    
    This patch leverages the AdlFileSystem in Hadoop to allow
    Impala to talk to the Azure Data Lake Store. This patch has
    functional changes as well as adds test infrastructure for
    testing Impala over ADLS.
    
    We do not support ACLs on ADLS since the Hadoop ADLS
    connector does not integrate ADLS ACLs with Hadoop users/groups.
    
    For testing, we use the azure-data-lake-store-python client
    from Microsoft. This client seems to have some consistency
    issues. For example, a drop table through Impala will delete
    the files in ADLS, however, listing that directory through
    the python client immediately after the drop, will still show
    the files. This behavior is unexpected since ADLS claims to be
    strongly consistent. Some tests have been skipped due to this
    limitation with the tag SkipIfADLS.slow_client. Tracked by
    IMPALA-5335.
    
    The azure-data-lake-store-python client also only works on CentOS 6.6
    and over, so the python dependencies for Azure will not be downloaded
    when the TARGET_FILESYSTEM is not "adls". While running ADLS tests,
    the expectation will be that it runs on a machine that is at least
    running CentOS 6.6.
    Note: This is only a test limitation, not a functional one. Clusters
    with older OSes like CentOS 6.4 will still work with ADLS.
    
    Added another dependency to bootstrap_build.sh for the ADLS Python
    client.
    
    Testing: Ran core tests with and without TARGET_FILESYSTEM as
    'adls' to make sure that all tests pass and that nothing breaks.
    
    Change-Id: Ic56b9988b32a330443f24c44f9cb2c80842f7542
    Reviewed-on: http://gerrit.cloudera.org:8080/6910
    Tested-by: Impala Public Jenkins
    Reviewed-by: Sailesh Mukil <[email protected]>

commit 2ffc86a5b218035cc42fa220f4d33a92b29d3fa6
Author: Sailesh Mukil <[email protected]>
Date:   2017-05-26T00:58:33Z

    IMPALA-5375: Builds on CentOS 6.4 failing with broken python dependencies
    
    Builds on CentOS 6.4 fail due to dependencies not met for the new
    'cryptography' python package.
    
    The ADLS commit states that the new packages are only required for ADLS
    and that ADLS on a dev environment is only supported from CentOS 6.7.
    
    This patch moves the compiled requirements for ADLS from
    compiled-requirements.txt to adls-requirements.txt and passing a
    compiler to the Pip environment while installing the ADLS
    requirements.
    
    Testing: Tested it on a machine that with TARGET_FILESYSTEM='adls'
    and also tested it on a CentOS 6.4 machine with the default
    configuration.
    
    Change-Id: I7d456a861a85edfcad55236aa8b0dbac2ff6fc78
    Reviewed-on: http://gerrit.cloudera.org:8080/6998
    Reviewed-by: Tim Armstrong <[email protected]>
    Tested-by: Impala Public Jenkins

commit 117fc388bff2a754be081eae7667627f84f1b33c
Author: Sailesh Mukil <[email protected]>
Date:   2017-05-30T18:56:43Z

    IMPALA-5383: Fix PARQUET_FILE_SIZE option for ADLS
    
    PARQUET_FILE_SIZE query option doesn't work with ADLS because the
    AdlFileSystem doesn't have a notion of block sizes. And impala depends
    on the filesystem remembering the block size which is then used as the
    target parquet file size (this is done for Hdfs so that the parquet file
    size and block size match even if the parquet_file_size isn't a valid
    blocksize).
    
    We special case for Adls just like we do for S3 to bypass the
    FileSystem block size, and instead just use the requested
    PARQUET_FILE_SIZE as the output partitions block_size (and consequently
    the parquet file target size).
    
    Testing: Re-enabled test_insert_parquet_verify_size() for ADLS.
    
    Also fixed a miscellaneous bug with the ADLS client listing helper function.
    
    Change-Id: I474a913b0ff9b2709f397702b58cb1c74251c25b
    Reviewed-on: http://gerrit.cloudera.org:8080/7018
    Reviewed-by: Sailesh Mukil <[email protected]>
    Tested-by: Impala Public Jenkins

commit b8558506957dbf44b8ceb29c8b7382bfd8180e05
Author: Sailesh Mukil <[email protected]>
Date:   2017-05-30T19:50:13Z

    IMPALA-5378: Disk IO manager needs to understand ADLS
    
    The Disk IO Manager had customized support for S3 and remote HDFS that
    allows for these to use a separate queue and have a customized number
    of IO threads. ADLS did not have this support.
    
    Based on the code in DiskIoMgr::Init and DiskIoMgr::AssignQueue, IOs
    for ADLS were previously put in local disk queues. Since local disks
    are considered rotational unless we can confirm otherwise by looking at
    the /sys filesystem, this means that THREADS_PER_ROTATIONAL_DISK=1 was
    being applied as the thread count.
    
    This patch adds customized support for ADLS, similar to how it was done
    for S3. We set 16 threads as the default number of IO threads for ADLS.
    For smaller clusters, setting a higher number like 64 would work better.
    We keep the thread count to a lower default of 16 since there is an
    undocumented concurrency limit for clusters, which is around 500-700
    connections, which means we would hurt node level parallelism if we
    have higher thread level parallelism, for larger clusters.
    
    We also set the default maximum chunk size for ADLS as 128k. This is due
    to the fact that direct reads aren't supported for ADLS, which means that
    the JNI array allocation and the memcpy adds significant overhead for
    larger buffers. 128k was chosen emperically for S3 for the same reason.
    Since this reason also holds for ADLS, we keep the same value. A new
    flag called FLAGS_adls_read_chunk_size is used to control this value.
    
    TODO: Settle on a buffer size with the most optimal buffer size
    emperically.
    
    Change-Id: I067f053fec941e3631610c5cc89a384f257ba906
    Reviewed-on: http://gerrit.cloudera.org:8080/7033
    Reviewed-by: Sailesh Mukil <[email protected]>
    Tested-by: Impala Public Jenkins

commit 5141a10ee1f71945dd5d15000796b7cf717e7928
Author: hzhelifu <[email protected]>
Date:   2017-08-28T07:05:37Z

    support runtimefilter for kudu.

commit b966906a442c4da9b039c70d96aaee9e39ee37dd
Author: hzhelifu <[email protected]>
Date:   2017-09-20T08:24:29Z

    æµè¯éè¿ã

commit d0f6041997e1cd91162a527dc5d4c1f1ca526bb6
Author: hzhelifu <[email protected]>
Date:   2017-09-25T06:19:19Z

    wait for runtimefilter.

commit 2e8fe3b33081cdcea05cda1827360a4154e913a0
Author: hzhelifu <[email protected]>
Date:   2017-09-28T05:29:39Z

    ä¿®æ¹å®æï¼ä½æ¯æ§è½ä¸è¡ã

----


---

[GitHub] incubator-impala pull request #6: Branch 2.10.0

Reply via email to