[ 
https://issues.apache.org/jira/browse/BEAM-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15652236#comment-15652236
 ] 

ASF GitHub Bot commented on BEAM-840:
-------------------------------------

GitHub user mizitch opened a pull request:

    https://github.com/apache/incubator-beam/pull/1327

    [BEAM-840] Some minor changes and fixes for sorter module. 

    Be sure to do all of the following to help us incorporate your contribution
    quickly and easily:
    
     - [x] Make sure the PR title is formatted like:
       `[BEAM-<Jira issue #>] Description of pull request`
     - [x] Make sure tests pass via `mvn clean verify`. (Even better, enable
           Travis-CI on your fork and ensure the whole test matrix passes).
     - [x] Replace `<Jira issue #>` in the title with the actual Jira issue
           number, if there is one.
     - [x] If this contribution is large, please file an Apache
           [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).
    
    ---
    Includes:
    * Limit max memory for ExternalSorter and BufferedExternalSorter to 2047 MB 
to prevent int overflow within Hadoop's sorting library
    * Fix int overflow for large memory values in InMemorySorter
    * Add note about estimated disk use to README.MD
    * Fix to make Hadoop's sorting library put all temp files under the 
specified directory
    * Have Hadoop clean up the temp directory on exit
    * Stop shading hadoop dependencies. Some context:
    ** The existing shading is broken (modules that depend on this one cannot 
use it successfully).
    ** Hadoop's use of reflection in several instances makes shading the 
dependency "in a good way" nearly impossible. It requires a couple of rather 
brittle hacks, and, for clients that depend on certain conflicting versions of 
hadoop these hacks can mean it doesn't meet its intended goal of preventing 
conflicts anyway.
    ** From what I can tell, there's no good way to shade this to make it 
universally usable, so leaving it unshaded seems like a reasonable default.
    ** Without shading Hadoop, this module can be successfully used from Beam's 
wordcount example (which actually does have pre-existing hadoop dependencies 
already).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mizitch/incubator-beam sorter-gcs

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-beam/pull/1327.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1327
    
----
commit d07c4ce9349abac4d0c53223072f1c84a1dc98c6
Author: Mitch Shanklin <mshank...@google.com>
Date:   2016-11-09T22:09:49Z

    Some minor changes and fixes for sorter module. Includes:
    
    * Limit max memory for ExternalSorter and BufferedExternalSorter to 2047 MB 
to prevent int overflow within Hadoop's sorting library
    * Fix int overflow for large memory values in InMemorySorter
    * Add note about estimated disk use to README.MD
    * Fix to make Hadoop's sorting library put all temp files under the 
specified directory
    * Have Hadoop clean up the temp directory on exit
    * Stop shading hadoop dependencies. Some context:
    ** The existing shading is broken (modules that depend on this one cannot 
use it successfully).
    ** Hadoop's use of reflection in several instances makes shading the 
dependency "in a good way" nearly impossible. It requires a couple of rather 
brittle hacks, and, for clients that depend on certain conflicting versions of 
hadoop these hacks can mean it doesn't meet its intended goal of preventing 
conflicts anyway.
    ** From what I can tell, there's no good way to shade this to make it 
universally usable, so leaving it unshaded seems like a reasonable default.
    ** Without shading Hadoop, this module can be successfully used from Beam's 
wordcount example (which actually does have pre-existing hadoop dependencies 
already).

----


> Add Java SDK extension to support non-distributed sorting
> ---------------------------------------------------------
>
>                 Key: BEAM-840
>                 URL: https://issues.apache.org/jira/browse/BEAM-840
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-extensions
>    Affects Versions: 0.4.0-incubating
>            Reporter: Mitch Shanklin
>            Assignee: Mitch Shanklin
>            Priority: Minor
>
> Add an extension that provides a PTransform which performs 
> local(non-distributed) sorting. It will sort in memory until the buffer is 
> full, then flush to disk and use external sorting.
>     
> Consumes a PCollection of KVs from primary key to iterable of secondary key 
> and value KVs and sorts the iterables. Would probably be called after a 
> GroupByKey. Uses coders to convert secondary keys and values into byte arrays 
> and does a lexicographical comparison on the secondary keys.
> Uses Hadoop as an external sorting library.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to