[ https://issues.apache.org/jira/browse/PHOENIX-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15095627#comment-15095627 ]
James Taylor commented on PHOENIX-2417: --------------------------------------- Thanks for the quick turn around on the patch, [~ankit.singhal]. Here's some feedback. bq. Is it fine to convert compressed guidePosts back to List<byte[]> in baseResultIterator for other faster operations like binary search and all? I'd be inclined not to convert it back. There's only one place I see where we do a binary search into it, and that's in BaseResultIterators to find the starting index given our current/start key here: {code} int guideIndex = currentKey.length == 0 ? 0 : getIndexContainingInclusive(gps, currentKey); {code} Otherwise, we're just iterating after that. Let's just modify that getIndexContainingInclusive to use our decoder to walk the compressed bytes linearly. Then in the loop below that, instead of accessing by index, we can just iterate to the next key as we'll be at the right point based on the above linear search. {code} while (guideIndex < gpsSize && (Bytes.compareTo(currentGuidePost = gps.get(guideIndex), endKey) <= 0 || endKey.length == 0)) { Scan newScan = scanRanges.intersectScan(scan, currentKey, currentGuidePost, keyOffset, false); scans = addNewScan(parallelScans, scans, newScan, currentGuidePost, false, regionLocation); currentKey = currentGuidePost; guideIndex++; } {code} Also, on the encode side, we shouldn't be re-copying the bytes with each add, but instead we'd want to buffer this using a ByteArrayOutputStream and use our encoder: {code} public boolean addGuidePost(byte[] row, long byteCount, long rowCount) { - if (guidePosts.isEmpty() || Bytes.compareTo(row, guidePosts.get(guidePosts.size() - 1)) > 0) { - List<byte[]> newGuidePosts = Lists.newArrayListWithExpectedSize(this.getGuidePosts().size() + 1); - newGuidePosts.addAll(guidePosts); - newGuidePosts.add(row); - this.guidePosts = ImmutableList.copyOf(newGuidePosts); - this.byteCount += byteCount; - this.keyByteSize += row.length; - this.rowCount+=rowCount; - return true; + if (row.length != 0 && Bytes.compareTo(lastRow, row) < 0) { + try { + encoder.encode(output, row, 0, row.length); + this.guidePosts.set(Bytes.copy(stream.getBuffer(), 0, stream.size())); + this.byteCount += byteCount; + this.guidePostsCount++; + this.maxLength = encoder.getMaxLength(); + this.rowCount += rowCount; + lastRow = row; + return true; + } catch (IOException e) { + return false; + } } return false; } {code} Similarly in StatisticsUtil.readStatistics(), we should be using a ByteArrayOutputStream and our encoder to incrementally collect the guideposts then only at the end get the array out of it. If it's not too much trouble, it'd be good to spin up a pull request as I think it'll make it easier to review. > Compress memory used by row key byte[] of guideposts > ---------------------------------------------------- > > Key: PHOENIX-2417 > URL: https://issues.apache.org/jira/browse/PHOENIX-2417 > Project: Phoenix > Issue Type: Sub-task > Reporter: James Taylor > Assignee: Ankit Singhal > Fix For: 4.7.0 > > Attachments: PHOENIX-2417.patch, PHOENIX-2417_encoder.diff > > > We've found that smaller guideposts are better in terms of minimizing any > increase in latency for point scans. However, this increases the amount of > memory significantly when caching the guideposts on the client. Guidepost are > equidistant row keys in the form of raw byte[] which are likely to have a > large percentage of their leading bytes in common (as they're stored in > sorted order. We should use a simple compression technique to mitigate this. > I noticed that Apache Parquet has a run length encoding - perhaps we can use > that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)