[
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13497577#comment-13497577
]
Hadoop QA commented on HBASE-4676:
----------------------------------
{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12553580/HBASE-4676-common-and-server-v8.patch
against trunk revision .
{color:green}+1 @author{color}. The patch does not contain any @author
tags.
{color:green}+1 tests included{color}. The patch appears to include 24 new
or modified tests.
{color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop
2.0 profile.
{color:red}-1 javadoc{color}. The javadoc tool appears to have generated
99 warning messages.
{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.
{color:red}-1 findbugs{color}. The patch appears to introduce 27 new
Findbugs (version 1.3.9) warnings.
{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.
{color:red}-1 core tests{color}. The patch failed these unit tests:
org.apache.hadoop.hbase.TestCompare
org.apache.hadoop.hbase.io.hfile.TestHFileReaderV1
org.apache.hadoop.hbase.io.hfile.TestHFileBlockCompatibility
org.apache.hadoop.hbase.io.hfile.TestLruBlockCache
org.apache.hadoop.hbase.regionserver.TestStoreFile
org.apache.hadoop.hbase.TestHTableDescriptor
org.apache.hadoop.hbase.io.TestHeapSize
org.apache.hadoop.hbase.TestHColumnDescriptor
org.apache.hadoop.hbase.master.TestCatalogJanitor
org.apache.hadoop.hbase.io.hfile.TestHFileDataBlockEncoder
org.apache.hadoop.hbase.regionserver.TestColumnSeeking
org.apache.hadoop.hbase.regionserver.TestRSStatusServlet
org.apache.hadoop.hbase.regionserver.TestScanner
org.apache.hadoop.hbase.regionserver.TestSplitTransaction
org.apache.hadoop.hbase.regionserver.TestHBase7051
org.apache.hadoop.hbase.coprocessor.TestCoprocessorInterface
org.apache.hadoop.hbase.TestFSTableDescriptorForceCreation
org.apache.hadoop.hbase.io.hfile.TestHFileInlineToRootChunkConversion
org.apache.hadoop.hbase.regionserver.TestBlocksScanned
org.apache.hadoop.hbase.regionserver.TestKeepDeletes
org.apache.hadoop.hbase.filter.TestDependentColumnFilter
org.apache.hadoop.hbase.regionserver.TestResettingCounters
org.apache.hadoop.hbase.TestSerialization
org.apache.hadoop.hbase.coprocessor.TestRegionObserverStacking
org.apache.hadoop.hbase.io.TestHalfStoreFileReader
org.apache.hadoop.hbase.io.hfile.TestSeekTo
org.apache.hadoop.hbase.regionserver.TestScanWithBloomError
org.apache.hadoop.hbase.regionserver.wal.TestWALActionsListener
org.apache.hadoop.hbase.regionserver.TestRegionSplitPolicy
org.apache.hadoop.hbase.io.hfile.TestCachedBlockQueue
org.apache.hadoop.hbase.regionserver.TestHRegionInfo
org.apache.hadoop.hbase.regionserver.TestCompactSelection
org.apache.hadoop.hbase.constraint.TestConstraints
org.apache.hadoop.hbase.filter.TestColumnPrefixFilter
org.apache.hadoop.hbase.io.hfile.TestHFile
org.apache.hadoop.hbase.filter.TestMultipleColumnPrefixFilter
org.apache.hadoop.hbase.regionserver.TestMinVersions
org.apache.hadoop.hbase.rest.model.TestTableRegionModel
org.apache.hadoop.hbase.regionserver.TestWideScanner
org.apache.hadoop.hbase.client.TestIntraRowPagination
org.apache.hadoop.hbase.io.hfile.TestReseekTo
org.apache.hadoop.hbase.filter.TestFilter
Test results:
https://builds.apache.org/job/PreCommit-HBASE-Build/3341//testReport/
Findbugs warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/3341//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/3341//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/3341//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/3341//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/3341//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/3341//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output:
https://builds.apache.org/job/PreCommit-HBASE-Build/3341//console
This message is automatically generated.
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
> Issue Type: New Feature
> Components: io, Performance, regionserver
> Affects Versions: 0.96.0
> Reporter: Matt Corgan
> Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch,
> HBASE-4676-common-and-server-v8.patch, HBASE-4676-prefix-tree-trunk-v1.patch,
> HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch,
> HBASE-4676-prefix-tree-trunk-v4.patch, HBASE-4676-prefix-tree-trunk-v5.patch,
> HBASE-4676-prefix-tree-trunk-v6.patch, HBASE-4676-prefix-tree-trunk-v7.patch,
> hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf,
> PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for
> applications that have high block cache hit ratios.
> First, there is no prefix compression, and the current KeyValue format is
> somewhat metadata heavy, so there can be tremendous memory bloat for many
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks. This
> means that every time you double the datablock size, average seek time (or
> average cpu consumption) goes up by a factor of 2. The standard 64KB block
> size is ~10x slower for random seeks than a 4KB block size, but block sizes
> as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB
> or more may be more efficient from a disk access and block-cache perspective
> in many big-data applications, but doing so is infeasible from a random seek
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these
> problems. Some features:
> * trie format for row key encoding completely eliminates duplicate row keys
> and encodes similar row keys into a standard trie structure which also saves
> a lot of space
> * the column family is currently stored once at the beginning of each block.
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which
> caters nicely to wide rows. duplicate qualifers between rows are eliminated.
> the size of this trie determines the width of the block's qualifier
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas
> are calculated from that. the maximum delta determines the width of the
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for
> the row trie, then the column trie, then the timestamp deltas, and then then
> all the values. Most work is done in the row trie, where every leaf node
> (corresponding to a row) contains a list of offsets/references corresponding
> to the cells in that row. Each cell is fixed-width to enable binary
> searching and is represented by [1 byte operationType, X bytes qualifier
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell
> overhead. Same for timestamps. Same for qualifiers when i get a chance.
> So, the compression aspect is very strong, but makes a few small sacrifices
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying
> further (suffix, etc) compression on the trie nodes at the cost of slower
> write speed. Even further compression could be obtained by using all VInts
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed. While programmed with good
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not
> programmed with the same level of optimization as the read path. Work will
> need to be done to optimize the data structures used for encoding and could
> probably show a 10x increase. It will still be slower than delta encoding,
> but with a much higher decode speed. I have not yet created a thorough
> benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient
> (probably within half or a quarter of its max read speed) the way that hbase
> currently uses it is far from optimal. The KeyValueScanner and related
> classes that iterate through the trie will eventually need to be smarter and
> have methods to do things like skipping to the next row of results without
> scanning every cell in between. When that is accomplished it will also allow
> much faster compactions because the full row key will not have to be compared
> as often as it is now.
> Current code is on github. The trie code is in a separate project than the
> slightly modified hbase. There is an hbase project there as well with the
> DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira