mcvsubbu commented on a change in pull request #4791: Support STRING and BYTES
for no dictionary columns in realtime consuming segments
URL: https://github.com/apache/incubator-pinot/pull/4791#discussion_r343790643
##########
File path:
pinot-core/src/main/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImpl.java
##########
@@ -212,19 +217,35 @@ public long getLatestIngestionTimestamp() {
}
DataFileReader indexReaderWriter;
- if (fieldSpec.isSingleValueField()) {
- String allocationContext =
- buildAllocationContext(_segmentName, column,
V1Constants.Indexes.UNSORTED_SV_FORWARD_INDEX_FILE_EXTENSION);
- indexReaderWriter = new
FixedByteSingleColumnSingleValueReaderWriter(_capacity, indexColumnSize,
_memoryManager,
- allocationContext);
+
+ if (forwardIndexColumnSize > 0) {
+ // two possible cases can lead here:
+ // (1) dictionary encoded forward index
+ // (2) raw forward index for fixed width types -- INT, LONG, FLOAT,
DOUBLE
+ if (fieldSpec.isSingleValueField()) {
+ String allocationContext =
+ buildAllocationContext(_segmentName, column,
V1Constants.Indexes.UNSORTED_SV_FORWARD_INDEX_FILE_EXTENSION);
+ indexReaderWriter = new
FixedByteSingleColumnSingleValueReaderWriter(_capacity, forwardIndexColumnSize,
_memoryManager,
+ allocationContext);
+ } else {
+ // TODO: Start with a smaller capacity on
FixedByteSingleColumnMultiValueReaderWriter and let it expand
+ String allocationContext =
+ buildAllocationContext(_segmentName, column,
V1Constants.Indexes.UNSORTED_MV_FORWARD_INDEX_FILE_EXTENSION);
+ indexReaderWriter =
+ new
FixedByteSingleColumnMultiValueReaderWriter(MAX_MULTI_VALUES_PER_ROW,
avgNumMultiValues, _capacity,
+ forwardIndexColumnSize, _memoryManager, allocationContext);
+ }
} else {
- // TODO: Start with a smaller capacity on
FixedByteSingleColumnMultiValueReaderWriter and let it expand
- String allocationContext =
- buildAllocationContext(_segmentName, column,
V1Constants.Indexes.UNSORTED_MV_FORWARD_INDEX_FILE_EXTENSION);
- indexReaderWriter =
- new
FixedByteSingleColumnMultiValueReaderWriter(MAX_MULTI_VALUES_PER_ROW,
avgNumMultiValues, _capacity,
- indexColumnSize, _memoryManager, allocationContext);
+ // for STRING/BYTES SV column, we support raw index in consuming
segments
+ // RealtimeSegmentStatsHistory does not have the stats for
no-dictionary columns
+ // from previous consuming segments
+ // TODO: come up with better estimated values
Review comment:
Cardinality should not be a factor here, since it is a raw index, and the
actual values are stored. You only need some estimate for the average string
length. We can get that from StatsHistory (as long as we update it correctly,
of course). The call to construct VarByteSiunceColumnSVRW should take _capacity
as the number of strings to add, and the averageLen that we can get from stats
history.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]