siddharthteotia commented on a change in pull request #4791: Support STRING and
BYTES for no dictionary columns in realtime consuming segments
URL: https://github.com/apache/incubator-pinot/pull/4791#discussion_r365270102
##########
File path:
pinot-core/src/main/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImpl.java
##########
@@ -212,19 +217,35 @@ public long getLatestIngestionTimestamp() {
}
DataFileReader indexReaderWriter;
- if (fieldSpec.isSingleValueField()) {
- String allocationContext =
- buildAllocationContext(_segmentName, column,
V1Constants.Indexes.UNSORTED_SV_FORWARD_INDEX_FILE_EXTENSION);
- indexReaderWriter = new
FixedByteSingleColumnSingleValueReaderWriter(_capacity, indexColumnSize,
_memoryManager,
- allocationContext);
+
+ if (forwardIndexColumnSize > 0) {
+ // two possible cases can lead here:
+ // (1) dictionary encoded forward index
+ // (2) raw forward index for fixed width types -- INT, LONG, FLOAT,
DOUBLE
+ if (fieldSpec.isSingleValueField()) {
+ String allocationContext =
+ buildAllocationContext(_segmentName, column,
V1Constants.Indexes.UNSORTED_SV_FORWARD_INDEX_FILE_EXTENSION);
+ indexReaderWriter = new
FixedByteSingleColumnSingleValueReaderWriter(_capacity, forwardIndexColumnSize,
_memoryManager,
+ allocationContext);
+ } else {
+ // TODO: Start with a smaller capacity on
FixedByteSingleColumnMultiValueReaderWriter and let it expand
+ String allocationContext =
+ buildAllocationContext(_segmentName, column,
V1Constants.Indexes.UNSORTED_MV_FORWARD_INDEX_FILE_EXTENSION);
+ indexReaderWriter =
+ new
FixedByteSingleColumnMultiValueReaderWriter(MAX_MULTI_VALUES_PER_ROW,
avgNumMultiValues, _capacity,
+ forwardIndexColumnSize, _memoryManager, allocationContext);
+ }
} else {
- // TODO: Start with a smaller capacity on
FixedByteSingleColumnMultiValueReaderWriter and let it expand
- String allocationContext =
- buildAllocationContext(_segmentName, column,
V1Constants.Indexes.UNSORTED_MV_FORWARD_INDEX_FILE_EXTENSION);
- indexReaderWriter =
- new
FixedByteSingleColumnMultiValueReaderWriter(MAX_MULTI_VALUES_PER_ROW,
avgNumMultiValues, _capacity,
- indexColumnSize, _memoryManager, allocationContext);
+ // for STRING/BYTES SV column, we support raw index in consuming
segments
+ // RealtimeSegmentStatsHistory does not have the stats for
no-dictionary columns
+ // from previous consuming segments
+ // TODO: come up with better estimated values
Review comment:
Cardinality was a wrong word I used for variable names. I meant number of
values (rows).
I have added a TODO to capture the estimated average column length in
realtime segment stats history for no dictionary columns as well. Currently we
only do this for dictionary encoded columns. For now, a constant value of 100
is being used. I will follow-up with a PR to add support for this.
Secondly, using _capacity (which is essentially the maxSegmentRows as
indicated RealtimeSegmentConfig) directly for allocating the memory for
VarByteSVRW might not be good. Note that VarByteSVRW internally uses
MutableOffHeapByteArrayStore which stores data in a list of buffers. Passing
_capacity means the byte store will try to allocate a single giant buffer to
store the strings/bytes for all rows. This memory allocation might fail if we
are using DIRECT off-heap memory mode and _capacity is very high and memory is
fragmented. So I do a min(_capacity, 100_000) as the initial capacity for
VarByteSVRW to begin with smaller capacity as opposed to _capacity
Looking at the code, it seems like we already have a TODO to start with
smaller capacity for MV columns.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]