Github user ajantha-bhat commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2814#discussion_r227667228
--- Diff:
processing/src/main/java/org/apache/carbondata/processing/store/CarbonFactDataHandlerColumnar.java
---
@@ -221,50 +224,106 @@ public void addDataToStore(CarbonRow row) throws
CarbonDataWriterException {
/**
* Check if column page can be added more rows after adding this row to
page.
+ * only few no-dictionary dimensions columns (string, varchar,
+ * complex columns) can grow huge in size.
*
- * A varchar column page uses
SafeVarLengthColumnPage/UnsafeVarLengthColumnPage to store data
- * and encoded using HighCardDictDimensionIndexCodec which will call
getByteArrayPage() from
- * column page and flatten into byte[] for compression.
- * Limited by the index of array, we can only put number of
Integer.MAX_VALUE bytes in a page.
*
- * Another limitation is from Compressor. Currently we use snappy as
default compressor,
- * and it will call MaxCompressedLength method to estimate the result
size for preparing output.
- * For safety, the estimate result is oversize: `32 + source_len +
source_len/6`.
- * So the maximum bytes to compress by snappy is (2GB-32)*6/7â1.71GB.
- *
- * Size of a row does not exceed 2MB since UnsafeSortDataRows uses 2MB
byte[] as rowBuffer.
- * Such that we can stop adding more row here if any long string column
reach this limit.
- *
- * If use unsafe column page, please ensure the memory configured is
enough.
- * @param row
- * @return false if any varchar column page cannot add one more
value(2MB)
+ * @param row carbonRow
+ * @return false if next rows can be added to same page.
+ * true if next rows cannot be added to same page
*/
- private boolean isVarcharColumnFull(CarbonRow row) {
- //TODO: test and remove this as now UnsafeSortDataRows can exceed 2MB
- if (model.getVarcharDimIdxInNoDict().size() > 0) {
+ private boolean needToCutThePage(CarbonRow row) {
--- End diff --
@xuchuanyin :
a. Yes made this as a table properties.
b. We need to keep this validataion for string columns also (can grow upto
1.8 GB), if that page can be fit into cache it can give better read performance.
c. There is no impact on load performance by checking this on each row.
because no extra computation happening. Just few checks for each row based on
data type.
TODO: find a default value and set it. if page size is not configured.
working on it. will handle in same PR
@ravipesala , @xuchuanyin : please check.
---