[ 
https://issues.apache.org/jira/browse/KUDU-3426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xixu Wang updated KUDU-3426:
----------------------------
    Description: 
Here is a segment of code in Python. It uses 'kudu cluster ksck' to get the 
tablet summaries in json format. If it uses UTF-8 to decode the result, it 
maybe fail. 

!image-2022-12-06-09-54-28-808.png!

One error will be got as follow:
{code:java}
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 21917: 
invalid start byte{code}
Using CP1252 to decode the result also failed:
{code:java}
 UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 113857: 
character maps to <undefined>{code}
Only Latin1 can decode the result.

 

I found the text in filed: partition_key_start/partition_key_end will cause 
this problem.

 

If executing the command 'kudu cluster ksck $master_addrs 
_{-}ksck_format={-}json_compact _{-}sections{-}=TABLET_SUMMARIES', the result 
of partition_key_start/partition_key_end is:

 
{code:java}
"partition":{"partition_key_start":"�\u0000K;","partition_key_end":"�\u0000K<"}{code}
 

 

The definition of PartitionPB is:
{code:java}
// The serialized format of a Kudu table partition. 
message PartitionPB { 
// The hash buckets of the partition. The number of hash buckets must match 
// the number of hash bucket components in the partition's schema. 
repeatedint32hash_buckets = 1 [packed = true]; 
// The encoded start partition key (inclusive). 
optionalbytespartition_key_start = 2; 
// The encoded end partition key (exclusive). 
optionalbytespartition_key_end = 3; 
} {code}


 As we can see, the type of partition_key_start and partition_key_end is bytes. 
In some machine, it needs UTF8 to decode the result, some needs CP1252 to 
decode, others need Latin1 to decode.
I offer a tricky way to solve this problem: using a flag: 
ignore_replica_status_info to void decoding the text of Partition. See: 
https://gerrit.cloudera.org/c/19320/

Can anyone offer a best way to solve the problem?
 

 

> Decode the result in json format of ksck maybe fail
> ---------------------------------------------------
>
>                 Key: KUDU-3426
>                 URL: https://issues.apache.org/jira/browse/KUDU-3426
>             Project: Kudu
>          Issue Type: Bug
>            Reporter: Xixu Wang
>            Priority: Major
>         Attachments: image-2022-12-06-09-54-28-808.png
>
>
> Here is a segment of code in Python. It uses 'kudu cluster ksck' to get the 
> tablet summaries in json format. If it uses UTF-8 to decode the result, it 
> maybe fail. 
> !image-2022-12-06-09-54-28-808.png!
> One error will be got as follow:
> {code:java}
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 21917: 
> invalid start byte{code}
> Using CP1252 to decode the result also failed:
> {code:java}
>  UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 
> 113857: character maps to <undefined>{code}
> Only Latin1 can decode the result.
>  
> I found the text in filed: partition_key_start/partition_key_end will cause 
> this problem.
>  
> If executing the command 'kudu cluster ksck $master_addrs 
> _{-}ksck_format={-}json_compact _{-}sections{-}=TABLET_SUMMARIES', the result 
> of partition_key_start/partition_key_end is:
>  
> {code:java}
> "partition":{"partition_key_start":"�\u0000K;","partition_key_end":"�\u0000K<"}{code}
>  
>  
> The definition of PartitionPB is:
> {code:java}
> // The serialized format of a Kudu table partition. 
> message PartitionPB { 
> // The hash buckets of the partition. The number of hash buckets must match 
> // the number of hash bucket components in the partition's schema. 
> repeatedint32hash_buckets = 1 [packed = true]; 
> // The encoded start partition key (inclusive). 
> optionalbytespartition_key_start = 2; 
> // The encoded end partition key (exclusive). 
> optionalbytespartition_key_end = 3; 
> } {code}
>  As we can see, the type of partition_key_start and partition_key_end is 
> bytes. In some machine, it needs UTF8 to decode the result, some needs CP1252 
> to decode, others need Latin1 to decode.
> I offer a tricky way to solve this problem: using a flag: 
> ignore_replica_status_info to void decoding the text of Partition. See: 
> https://gerrit.cloudera.org/c/19320/
> Can anyone offer a best way to solve the problem?
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to