[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #4791: Support STRING and BYTES for no dictionary columns in realtime consuming segments

GitBox Fri, 10 Jan 2020 14:25:52 -0800

siddharthteotia commented on a change in pull request #4791: Support STRING and 
BYTES for no dictionary columns in realtime consuming segments
URL: https://github.com/apache/incubator-pinot/pull/4791#discussion_r365270102


 ##########
 File path: 
pinot-core/src/main/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImpl.java
 ##########
 @@ -212,19 +217,35 @@ public long getLatestIngestionTimestamp() {
       }
 
       DataFileReader indexReaderWriter;
-      if (fieldSpec.isSingleValueField()) {
-        String allocationContext =
-            buildAllocationContext(_segmentName, column, 
V1Constants.Indexes.UNSORTED_SV_FORWARD_INDEX_FILE_EXTENSION);
-        indexReaderWriter = new 
FixedByteSingleColumnSingleValueReaderWriter(_capacity, indexColumnSize, 
_memoryManager,
-            allocationContext);
+
+      if (forwardIndexColumnSize > 0) {
+        // two possible cases can lead here:
+        // (1) dictionary encoded forward index
+        // (2) raw forward index for fixed width types -- INT, LONG, FLOAT, 
DOUBLE
+        if (fieldSpec.isSingleValueField()) {
+          String allocationContext =
+              buildAllocationContext(_segmentName, column, 
V1Constants.Indexes.UNSORTED_SV_FORWARD_INDEX_FILE_EXTENSION);
+          indexReaderWriter = new 
FixedByteSingleColumnSingleValueReaderWriter(_capacity, forwardIndexColumnSize, 
_memoryManager,
+              allocationContext);
+        } else {
+          // TODO: Start with a smaller capacity on 
FixedByteSingleColumnMultiValueReaderWriter and let it expand
+          String allocationContext =
+              buildAllocationContext(_segmentName, column, 
V1Constants.Indexes.UNSORTED_MV_FORWARD_INDEX_FILE_EXTENSION);
+          indexReaderWriter =
+              new 
FixedByteSingleColumnMultiValueReaderWriter(MAX_MULTI_VALUES_PER_ROW, 
avgNumMultiValues, _capacity,
+                  forwardIndexColumnSize, _memoryManager, allocationContext);
+        }
       } else {
-        // TODO: Start with a smaller capacity on 
FixedByteSingleColumnMultiValueReaderWriter and let it expand
-        String allocationContext =
-            buildAllocationContext(_segmentName, column, 
V1Constants.Indexes.UNSORTED_MV_FORWARD_INDEX_FILE_EXTENSION);
-        indexReaderWriter =
-            new 
FixedByteSingleColumnMultiValueReaderWriter(MAX_MULTI_VALUES_PER_ROW, 
avgNumMultiValues, _capacity,
-                indexColumnSize, _memoryManager, allocationContext);
+        // for STRING/BYTES SV column, we support raw index in consuming 
segments
+        // RealtimeSegmentStatsHistory does not have the stats for 
no-dictionary columns
+        // from previous consuming segments
+        // TODO: come up with better estimated values
 
 Review comment:
   Cardinality was a wrong word I used for variable names. I meant number of 
values (rows).
   
   I have added a TODO to capture the estimated average column length in 
realtime segment stats history for no dictionary columns as well. Currently we 
only do this for dictionary encoded columns. For now, a constant value of 100 
is being used. I will follow-up with a PR to add support for this.
   
   Secondly, using _capacity (which is essentially the maxSegmentRows as 
indicated RealtimeSegmentConfig) directly for allocating the memory for 
VarByteSVRW might not be good. Note that VarByteSVRW internally uses 
MutableOffHeapByteArrayStore which stores data in a list of buffers. Passing 
_capacity means the byte store will try to allocate a single giant buffer to 
store the strings/bytes for all rows. This memory allocation might fail if we 
are using DIRECT off-heap memory mode and _capacity is very high and memory is 
fragmented. So I do a min(_capacity, 100_000) as the initial capacity for 
VarByteSVRW to begin with smaller capacity as opposed to _capacity
   
   Looking at the code, it seems like we already have a TODO to start with 
smaller capacity for MV columns.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #4791: Support STRING and BYTES for no dictionary columns in realtime consuming segments

Reply via email to