[ https://issues.apache.org/jira/browse/PHOENIX-4382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299360#comment-16299360 ]
Vincent Poon edited comment on PHOENIX-4382 at 12/21/17 1:24 AM: ----------------------------------------------------------------- [~jamestaylor] since we store everything using PVarbinary, the original code is such that we will always store a null as {separatorByte, #_of_nulls}, no matter what the actual column type in the table schema. Now, there's two different things we could be discussing here: what the original code did, and what my patch does. For the original code, there was a short circuit that just checked for separatorByte, so any column value that started with separator byte would be mistaken as a null. This happens for both fixed and variable length columns, since everything is stored and interpreted as varbinary. This was the bug. In my patch, I changed the old scheme interoperation such that we only return null if the value 1) starts with separatorByte, and 2) is a length of 2. That's because there's no way for us to distinguish between - a single null which we write as {separatorByte, 1} - a value with length of 2 that starts with separatorByte and ends with 1 Knowing the type of the column in the schema doesn't help much, other than letting us know that if it's a fixed column of length > 2, we know it's not a null if the value we encounter has length > 2 (since a single null would have a length of 2). But my code already pretty much does the same thing. And just for further clarification, with my patch, there should only be exactly two values that will return incorrect results. If you store the following: {ascSeparatorByte, 1} {descSeparatorByte, 1} Then we can't distinguish those from a null, even knowing the column type. That should happen only if you're storing two specific SMALLINT values, or a variable length value equivalent. was (Author: vincentpoon): [~jamestaylor] since we store everything using PVarbinary, the original code is such that we will always store a null as {separatorByte, #_of_nulls}, no matter what the actual column type in the table schema. Now, there's two different things we could be discussing here: what the original code did, and what my patch does. For the original code, there was a short circuit that just checked for separatorByte, so any column value that started with separator byte would be mistaken as a null. This happens for both fixed and variable length columns, since everything is stored and interpreted as varbinary. This was the bug. In my patch, I changed the old scheme interoperation such that we only return null if the value 1) starts with separatorByte, and 2) is a length of 2. That's because there's no way for us to distinguish between - a single null which we write as {separatorByte, 1} - a value with length of 2 that starts with separatorByte Knowing the type of the column in the schema doesn't help much, other than letting us know that if it's a fixed column of length > 2, we know it's not a null if the value we encounter has length > 2 (since a single null would have a length of 2). But my code already pretty much does the same thing. And just for further clarification, with my patch, there should only be exactly two values that will return incorrect results. If you store the following: {ascSeparatorByte, 1} {descSeparatorByte, 1} Then we can't distinguish those from a null, even knowing the column type. That should happen only if you're storing two specific SMALLINT values, or a variable length value equivalent. > Immutable table SINGLE_CELL_ARRAY_WITH_OFFSETS values starting with separator > byte return null in query results > --------------------------------------------------------------------------------------------------------------- > > Key: PHOENIX-4382 > URL: https://issues.apache.org/jira/browse/PHOENIX-4382 > Project: Phoenix > Issue Type: Bug > Affects Versions: 4.14.0 > Reporter: Vincent Poon > Assignee: Vincent Poon > Attachments: PHOENIX-4382.v1.master.patch, > PHOENIX-4382.v2.master.patch, UpsertBigValuesIT.java > > > For immutable tables, upsert of some values like Short.MAX_VALUE results in a > null value in query resultsets. Mutable tables are not affected. I tried > with BigInt and got the same problem. > For Short, the breaking point seems to be 32512. > This is happening because of the way we serialize nulls. For nulls, we write > out [separatorByte, #_of_nulls]. However, some data values, like > Short.MAX_VALUE, start with separatorByte, we can't distinguish between a > null and these values. Currently the code assumes it's a null when it sees a > leading separatorByte, hence the incorrect query results. > See attached test - testShort() , testBigInt() -- This message was sent by Atlassian JIRA (v6.4.14#64029)