[ https://issues.apache.org/jira/browse/BEAM-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15652236#comment-15652236 ]
ASF GitHub Bot commented on BEAM-840: ------------------------------------- GitHub user mizitch opened a pull request: https://github.com/apache/incubator-beam/pull/1327 [BEAM-840] Some minor changes and fixes for sorter module. Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-<Jira issue #>] Description of pull request` - [x] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `<Jira issue #>` in the title with the actual Jira issue number, if there is one. - [x] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- Includes: * Limit max memory for ExternalSorter and BufferedExternalSorter to 2047 MB to prevent int overflow within Hadoop's sorting library * Fix int overflow for large memory values in InMemorySorter * Add note about estimated disk use to README.MD * Fix to make Hadoop's sorting library put all temp files under the specified directory * Have Hadoop clean up the temp directory on exit * Stop shading hadoop dependencies. Some context: ** The existing shading is broken (modules that depend on this one cannot use it successfully). ** Hadoop's use of reflection in several instances makes shading the dependency "in a good way" nearly impossible. It requires a couple of rather brittle hacks, and, for clients that depend on certain conflicting versions of hadoop these hacks can mean it doesn't meet its intended goal of preventing conflicts anyway. ** From what I can tell, there's no good way to shade this to make it universally usable, so leaving it unshaded seems like a reasonable default. ** Without shading Hadoop, this module can be successfully used from Beam's wordcount example (which actually does have pre-existing hadoop dependencies already). You can merge this pull request into a Git repository by running: $ git pull https://github.com/mizitch/incubator-beam sorter-gcs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1327.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1327 ---- commit d07c4ce9349abac4d0c53223072f1c84a1dc98c6 Author: Mitch Shanklin <mshank...@google.com> Date: 2016-11-09T22:09:49Z Some minor changes and fixes for sorter module. Includes: * Limit max memory for ExternalSorter and BufferedExternalSorter to 2047 MB to prevent int overflow within Hadoop's sorting library * Fix int overflow for large memory values in InMemorySorter * Add note about estimated disk use to README.MD * Fix to make Hadoop's sorting library put all temp files under the specified directory * Have Hadoop clean up the temp directory on exit * Stop shading hadoop dependencies. Some context: ** The existing shading is broken (modules that depend on this one cannot use it successfully). ** Hadoop's use of reflection in several instances makes shading the dependency "in a good way" nearly impossible. It requires a couple of rather brittle hacks, and, for clients that depend on certain conflicting versions of hadoop these hacks can mean it doesn't meet its intended goal of preventing conflicts anyway. ** From what I can tell, there's no good way to shade this to make it universally usable, so leaving it unshaded seems like a reasonable default. ** Without shading Hadoop, this module can be successfully used from Beam's wordcount example (which actually does have pre-existing hadoop dependencies already). ---- > Add Java SDK extension to support non-distributed sorting > --------------------------------------------------------- > > Key: BEAM-840 > URL: https://issues.apache.org/jira/browse/BEAM-840 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions > Affects Versions: 0.4.0-incubating > Reporter: Mitch Shanklin > Assignee: Mitch Shanklin > Priority: Minor > > Add an extension that provides a PTransform which performs > local(non-distributed) sorting. It will sort in memory until the buffer is > full, then flush to disk and use external sorting. > > Consumes a PCollection of KVs from primary key to iterable of secondary key > and value KVs and sorts the iterables. Would probably be called after a > GroupByKey. Uses coders to convert secondary keys and values into byte arrays > and does a lexicographical comparison on the secondary keys. > Uses Hadoop as an external sorting library. -- This message was sent by Atlassian JIRA (v6.3.4#6332)