[
https://issues.apache.org/jira/browse/HADOOP-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702097#action_12702097
]
Dick King commented on HADOOP-5705:
-----------------------------------
No new tests are needed by this patch because it improves performance without
causing any changes to the behavior of the total ordering partitioner.
If you believe that tries are worthwhile at all, then you must believe that
it's worthwhile to run the tries as deeply as possible. Java's recursion stack
depth is 500 IIRC, so I set the depth limit to 200. I surmise that almost all
input split set adjacent element pairs differ before the 200th byte but that
cosuming 201 stack frames won't break very many contexts where this is called.
> Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.
> ----------------------------------------------------------------------
>
> Key: HADOOP-5705
> URL: https://issues.apache.org/jira/browse/HADOOP-5705
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.19.1
> Reporter: Dick King
> Assignee: Dick King
> Attachments: hadoop-5705-no-names.patch, hadoop-5705.patch
>
>
> With the old technology, if a particular node in the trie has many children
> that contain no split points, TotalOrderPartitioner creates a separate empty
> leaf node for each one. This takes a lot of space, which in turn limits the
> depth to which we can grow these trees to avoid making too many sparse nodes,
> so there is a parameter that defaults to 2 to control the depth. With this
> patch, I can guarantee that each split point will create only three trie
> nodes in the trie: an empty one, a leaf that contains only one split point,
> and the internal nodes as needed, for a total space of perhaps 50-200 bytes
> per split point, even if the entire trie is elaborated. There are
> pathological cases that can cause a recursion overflow during creation, so i
> have set the default tree max depth to 200, but I expect almost all tries to
> be fully elaborated, which means in turn that each byte of the sought key
> gets touched at most twice.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.