For those following along, I figured this one out:
https://issues.apache.org/jira/browse/KUDU-1677

Turns out to be a long-standing bug, unrelated to alter-table or add/drop
range partitions.

On Tue, Oct 4, 2016 at 12:24 PM, Todd Lipcon <[email protected]> wrote:

> Good point that they're in the same partition. I have the data dir from
> this failure as well, so I'll dig in and see what I can determine. Dont
> worry about spending time on it for now, since it seems like it's not
> related to add/drop. Will report back if I need any more pointers.
>
> -Todd
>
> On Mon, Oct 3, 2016 at 10:56 PM, Dan Burkert <[email protected]> wrote:
>
>> It's also strange that the first two rows have the same value for c79.
>> That is extremely unlikely.  I can dig in more tomorrow.
>>
>> - Dan
>>
>> On Mon, Oct 3, 2016 at 10:53 PM, Dan Burkert <[email protected]> wrote:
>>
>>> The first three rows (including the out of order row) all fall in the
>>> same range partition, so the issue is likely that the intra-tablet scan
>>> returned out of order results (as opposed to the client scanning tablets
>>> out of order). I'm under the same impression about SetFaultTolerant(),
>>> which is why the test explicitly sets it.  How often is this happening?
>>> Back when this test was committed a few months ago I ran it a few thousand
>>> times and never saw anything like this.
>>>
>>> On Mon, Oct 3, 2016 at 10:35 PM, Todd Lipcon <[email protected]> wrote:
>>>
>>>> Hey Dan (+CC dev in case anyone else knows about this too)
>>>>
>>>> I'm debugging some flakiness in alter_table-randomized-test, and ti
>>>> seems like it's failing because the verification scan is returning some out
>>>> of order rows, despite using "SetFaultTolerant()". Granted, fault tolerance
>>>> isn't publicly guaranteed to return rows in order, but I was under the
>>>> impression that, with range partitioned tablets, it would always do so.
>>>>
>>>> The scan result I'm seeing has the following sequence within it:
>>>>
>>>> (int32 key=537424064, int32 c945=NULL, int32 c79=234639860, int32
>>>> c990=NULL)
>>>> >>>> OUT OF ORDER ROW
>>>> (int32 key=552025439, int32 c945=NULL, int32 c79=234639860, int32
>>>> c990=NULL)
>>>> >>>> BACK TO NORMAL ORDER
>>>> (int32 key=539314778, int32 c945=1708089980, int32 c79=-878787336,
>>>> int32 c990=829302644)
>>>> (int32 key=541817227, int32 c945=2064952224, int32 c79=2064952224,
>>>> int32 c990=NULL)
>>>> (int32 key=546056206, int32 c945=26527696, int32 c79=26527696, int32
>>>> c990=26527696)
>>>> (int32 key=601960253, int32 c945=NULL, int32 c79=1088757503, int32
>>>> c990=NULL)
>>>> (int32 key=677154987, int32 c945=823764490, int32 c79=823764490, int32
>>>> c990=823764490)
>>>>
>>>> The prior alter was:
>>>> I1004 05:17:48.192611 28113 alter_table-randomized-test.cc:481]
>>>> Dropping range partition: [805306356, 872415219) resulting partitions:
>>>> (134217726, 201326589], (268435452, 335544315], (335544315, 402653178],
>>>> (402653178, 469762041], (536870904, 603979767], (671088630, 738197493],
>>>> (738197493, 805306356], (939524082, 1006632945], (1006632945, 1073741808],
>>>> (1275068397, 1342177260], (1342177260, 1409286123], (1409286123,
>>>> 1476394986], (1610612712, 1677721575], (1879048164, 1946157027],
>>>> (2013265890, 2080374753], (2080374753, 2147483616)
>>>> I1004 05:17:48.193013 28113 alter_table-randomized-test.cc:406]
>>>> Committing Alterations
>>>>
>>>> The whole log is available here:
>>>> https://gist.githubusercontent.com/toddlipcon/466976caf973f4
>>>> 96885da9efc2f7246c/raw/f9baf418dad4ad07f33961b131c86e8480381
>>>> 5a8/alter_table-randomized-test.txt
>>>>
>>>> Any ideas what might be causing this out-of-order result? Is the test
>>>> making some incorrect assumptions or might we have a bug?
>>>>
>>>> -Todd
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Todd Lipcon
>>>> Software Engineer, Cloudera
>>>>
>>>
>>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to