[
https://issues.apache.org/jira/browse/KUDU-3426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xixu Wang updated KUDU-3426:
----------------------------
Description:
Here is a segment of code in Python. It uses 'kudu cluster ksck' to get the
tablet summaries in json format. If it uses UTF-8 to decode the result, it
maybe fail.
!image-2022-12-06-09-54-28-808.png!
One error will be got as follow:
{code:java}
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 21917:
invalid start byte{code}
Using CP1252 to decode the result also failed:
{code:java}
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 113857:
character maps to <undefined>{code}
Only Latin1 can decode the result.
I found the text in filed: partition_key_start/partition_key_end will cause
this problem.
If executing the command 'kudu cluster ksck $master_addrs
_{-}ksck_format={-}json_compact _{-}sections{-}=TABLET_SUMMARIES', the result
of partition_key_start/partition_key_end is:
{code:java}
"partition":{"partition_key_start":"�\u0000K;","partition_key_end":"�\u0000K<"}{code}
The definition of PartitionPB is:
{code:java}
// The serialized format of a Kudu table partition.
message PartitionPB {
// The hash buckets of the partition. The number of hash buckets must match
// the number of hash bucket components in the partition's schema.
repeated int32 hash_buckets = 1 [packed = true];
// The encoded start partition key (inclusive).
optional bytes partition_key_start = 2;
// The encoded end partition key (exclusive).
optional bytes partition_key_end = 3;
} {code}
As we can see, the type of partition_key_start and partition_key_end is bytes.
In some machine, it needs UTF8 to decode the result, some needs CP1252 to
decode, others need Latin1 to decode.
I offer a tricky way to solve this problem: using a flag:
ignore_replica_status_info to void decoding the text of Partition. See:
https://gerrit.cloudera.org/c/19320/
Can anyone offer a best way to solve the problem?
was:
Here is a segment of code in Python. It uses 'kudu cluster ksck' to get the
tablet summaries in json format. If it uses UTF-8 to decode the result, it
maybe fail.
!image-2022-12-06-09-54-28-808.png!
One error will be got as follow:
{code:java}
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 21917:
invalid start byte{code}
Using CP1252 to decode the result also failed:
{code:java}
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 113857:
character maps to <undefined>{code}
Only Latin1 can decode the result.
I found the text in filed: partition_key_start/partition_key_end will cause
this problem.
If executing the command 'kudu cluster ksck $master_addrs
_{-}ksck_format={-}json_compact _{-}sections{-}=TABLET_SUMMARIES', the result
of partition_key_start/partition_key_end is:
{code:java}
"partition":{"partition_key_start":"�\u0000K;","partition_key_end":"�\u0000K<"}{code}
The definition of PartitionPB is:
{code:java}
// The serialized format of a Kudu table partition.
message PartitionPB {
// The hash buckets of the partition. The number of hash buckets must match
// the number of hash bucket components in the partition's schema.
repeatedint32hash_buckets = 1 [packed = true];
// The encoded start partition key (inclusive).
optionalbytespartition_key_start = 2;
// The encoded end partition key (exclusive).
optionalbytespartition_key_end = 3;
} {code}
As we can see, the type of partition_key_start and partition_key_end is bytes.
In some machine, it needs UTF8 to decode the result, some needs CP1252 to
decode, others need Latin1 to decode.
I offer a tricky way to solve this problem: using a flag:
ignore_replica_status_info to void decoding the text of Partition. See:
https://gerrit.cloudera.org/c/19320/
Can anyone offer a best way to solve the problem?
> Decode the result in json format of ksck maybe fail
> ---------------------------------------------------
>
> Key: KUDU-3426
> URL: https://issues.apache.org/jira/browse/KUDU-3426
> Project: Kudu
> Issue Type: Bug
> Reporter: Xixu Wang
> Priority: Major
> Attachments: image-2022-12-06-09-54-28-808.png
>
>
> Here is a segment of code in Python. It uses 'kudu cluster ksck' to get the
> tablet summaries in json format. If it uses UTF-8 to decode the result, it
> maybe fail.
> !image-2022-12-06-09-54-28-808.png!
> One error will be got as follow:
> {code:java}
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 21917:
> invalid start byte{code}
> Using CP1252 to decode the result also failed:
> {code:java}
> UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position
> 113857: character maps to <undefined>{code}
> Only Latin1 can decode the result.
>
> I found the text in filed: partition_key_start/partition_key_end will cause
> this problem.
>
> If executing the command 'kudu cluster ksck $master_addrs
> _{-}ksck_format={-}json_compact _{-}sections{-}=TABLET_SUMMARIES', the result
> of partition_key_start/partition_key_end is:
>
> {code:java}
> "partition":{"partition_key_start":"�\u0000K;","partition_key_end":"�\u0000K<"}{code}
>
>
> The definition of PartitionPB is:
> {code:java}
> // The serialized format of a Kudu table partition.
> message PartitionPB {
> // The hash buckets of the partition. The number of hash buckets must match
> // the number of hash bucket components in the partition's schema.
> repeated int32 hash_buckets = 1 [packed = true];
> // The encoded start partition key (inclusive).
> optional bytes partition_key_start = 2;
> // The encoded end partition key (exclusive).
> optional bytes partition_key_end = 3;
> } {code}
> As we can see, the type of partition_key_start and partition_key_end is
> bytes. In some machine, it needs UTF8 to decode the result, some needs CP1252
> to decode, others need Latin1 to decode.
> I offer a tricky way to solve this problem: using a flag:
> ignore_replica_status_info to void decoding the text of Partition. See:
> https://gerrit.cloudera.org/c/19320/
> Can anyone offer a best way to solve the problem?
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)