[jira] [Updated] (KYLIN-3115) Incompatible RowKeySplitter initialize between build and merge job

Wang, Gang (JIRA) Sun, 17 Dec 2017 23:41:15 -0800

     [ 
https://issues.apache.org/jira/browse/KYLIN-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wang, Gang updated KYLIN-3115:
------------------------------
    Description: 
In class NDCuboidBuilder. 
public NDCuboidBuilder(CubeSegment cubeSegment) {
    this.cubeSegment = cubeSegment;
    this.rowKeySplitter = new RowKeySplitter(cubeSegment, 65, 256);
    this.rowKeyEncoderProvider = new RowKeyEncoderProvider(cubeSegment);
}
which will create a temp bytes array with length 256 to fill in rowkey column 
bytes.

While, in class MergeCuboidMapper it's initialized with length 255. 
rowKeySplitter = new RowKeySplitter(sourceCubeSegment, 65, 255);

So, if a dimension is encoded in fixed length and the max length is set to 256. 
The cube building job will succeed. While, the merge job will always fail. 
Since in class MergeCuboidMapper method doMap:
    public void doMap(Text key, Text value, Context context) throws 
IOException, InterruptedException {
        long cuboidID = rowKeySplitter.split(key.getBytes());
        Cuboid cuboid = Cuboid.findForMandatory(cubeDesc, cuboidID);

in method doMap, it will invoke method RowKeySplitter.split(byte[] bytes):
        for (int i = 0; i < cuboid.getColumns().size(); i++) {
            splitOffsets[i] = offset;
            TblColRef col = cuboid.getColumns().get(i);
            int colLength = colIO.getColumnLength(col);
            SplittedBytes split = this.splitBuffers[this.bufferSize++];
            split.length = colLength;
            System.arraycopy(bytes, offset, split.value, 0, colLength);_
            offset += colLength;
        }
Method System.arraycopy will result in IndexOutOfBoundsException exception, if 
a column value length is 256 in bytes and is being copied to a bytes array with 
length 255.

The incompatibility is also occurred in class FilterRecommendCuboidDataMapper, 
initialize RowkeySplitter as: 
rowKeySplitter = new RowKeySplitter(originalSegment, 65, 255);

I think the better way is to always set the max split length as 256.
And actually dimension encoded in fix length 256 is pretty common in our 
production. Since in Hive, type varchar(256) is pretty common, users does have 
not much knowledge will prefer to chose fix length encoding on such dimensions, 
and set max length as 256. 






  was:
In class NDCuboidBuilder. 
public NDCuboidBuilder(CubeSegment cubeSegment) {
    this.cubeSegment = cubeSegment;
    this.rowKeySplitter =* new RowKeySplitter(cubeSegment, 65, 256)*;
    this.rowKeyEncoderProvider = new RowKeyEncoderProvider(cubeSegment);
}
which will create a temp bytes array with length 256 to fill in rowkey column 
bytes.

While, in class MergeCuboidMapper it's initialized with length 255. 
rowKeySplitter = new RowKeySplitter(sourceCubeSegment, 65, 255);

So, if a dimension is encoded in fixed length and the max length is set to 256. 
The cube building job will succeed. While, the merge job will always fail. 
Since in class MergeCuboidMapper method doMap:
    public void doMap(Text key, Text value, Context context) throws 
IOException, InterruptedException {
        long cuboidID = rowKeySplitter.split(key.getBytes());
        Cuboid cuboid = Cuboid.findForMandatory(cubeDesc, cuboidID);

in method doMap, it will invoke method RowKeySplitter.split(byte[] bytes):
        for (int i = 0; i < cuboid.getColumns().size(); i++) {
            splitOffsets[i] = offset;
            TblColRef col = cuboid.getColumns().get(i);
            int colLength = colIO.getColumnLength(col);
            SplittedBytes split = this.splitBuffers[this.bufferSize++];
            split.length = colLength;
            System.arraycopy(bytes, offset, split.value, 0, colLength);_
            offset += colLength;
        }
Method System.arraycopy will result in IndexOutOfBoundsException exception, if 
a column value length is 256 in bytes and is being copied to a bytes array with 
length 255.

The incompatibility is also occurred in class FilterRecommendCuboidDataMapper, 
initialize RowkeySplitter as: 
rowKeySplitter = new RowKeySplitter(originalSegment, 65, 255);

I think the better way is to always set the max split length as 256.
And actually dimension encoded in fix length 256 is pretty common in our 
production. Since in Hive, type varchar(256) is pretty common, users does have 
not much knowledge will prefer to chose fix length encoding on such dimensions, 
and set max length as 256. 







> Incompatible RowKeySplitter initialize between build and merge job
> ------------------------------------------------------------------
>
>                 Key: KYLIN-3115
>                 URL: https://issues.apache.org/jira/browse/KYLIN-3115
>             Project: Kylin
>          Issue Type: Bug
>            Reporter: Wang, Gang
>            Assignee: Wang, Gang
>            Priority: Minor
>
> In class NDCuboidBuilder. 
> public NDCuboidBuilder(CubeSegment cubeSegment) {
>     this.cubeSegment = cubeSegment;
>     this.rowKeySplitter = new RowKeySplitter(cubeSegment, 65, 256);
>     this.rowKeyEncoderProvider = new RowKeyEncoderProvider(cubeSegment);
> }
> which will create a temp bytes array with length 256 to fill in rowkey column 
> bytes.
> While, in class MergeCuboidMapper it's initialized with length 255. 
> rowKeySplitter = new RowKeySplitter(sourceCubeSegment, 65, 255);
> So, if a dimension is encoded in fixed length and the max length is set to 
> 256. The cube building job will succeed. While, the merge job will always 
> fail. Since in class MergeCuboidMapper method doMap:
>     public void doMap(Text key, Text value, Context context) throws 
> IOException, InterruptedException {
>         long cuboidID = rowKeySplitter.split(key.getBytes());
>         Cuboid cuboid = Cuboid.findForMandatory(cubeDesc, cuboidID);
> in method doMap, it will invoke method RowKeySplitter.split(byte[] bytes):
>         for (int i = 0; i < cuboid.getColumns().size(); i++) {
>             splitOffsets[i] = offset;
>             TblColRef col = cuboid.getColumns().get(i);
>             int colLength = colIO.getColumnLength(col);
>             SplittedBytes split = this.splitBuffers[this.bufferSize++];
>             split.length = colLength;
>             System.arraycopy(bytes, offset, split.value, 0, colLength);_
>             offset += colLength;
>         }
> Method System.arraycopy will result in IndexOutOfBoundsException exception, 
> if a column value length is 256 in bytes and is being copied to a bytes array 
> with length 255.
> The incompatibility is also occurred in class 
> FilterRecommendCuboidDataMapper, initialize RowkeySplitter as: 
> rowKeySplitter = new RowKeySplitter(originalSegment, 65, 255);
> I think the better way is to always set the max split length as 256.
> And actually dimension encoded in fix length 256 is pretty common in our 
> production. Since in Hive, type varchar(256) is pretty common, users does 
> have not much knowledge will prefer to chose fix length encoding on such 
> dimensions, and set max length as 256. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (KYLIN-3115) Incompatible RowKeySplitter initialize between build and merge job

Reply via email to