[
https://issues.apache.org/jira/browse/HBASE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benoit Sigoure updated HBASE-2323:
----------------------------------
Description:
I'm trying to use {{RegexStringComparator}} in conjunction with {{RowFilter}}.
One of my row keys contained the byte 0xA, which turns out to be the ASCII code
for the newline character (\n). When the row key is converted to a string in
order to use the regexp facility of the Java standard library, it becomes a
string containing two lines and my regexp does not match.
I believe the solution is to compile the regexp with the {{DOTALL}} flag.
Luckily, this flag can be "passed" by the client by prefixing the regexp with
{{(?s)}} so people working with an older version of HBase can work around this
issue without having to upgrade.
Second problem: One of my row keys contained the sequence {{0x00 0x00 0x9D}}
({{0x9D}} = -99 when stored in a Java {{byte}}) but in {{compareTo}} the row
key is transformed in a {{String}} using {{Bytes.toString}}, which just assumes
that the byte array is an UTF8 encoded string. Java "cleverly" substituted the
0x9D byte with 0x63 (character '?'). In my case, I want to use encoding
ISO-8859-1 as it preserves every byte when the byte array is converted to a
{{String}} and back to a byte array, unlike UTF-8 or ASCII. Should we add a
new method to {{RegexStringComparator}} to allow the user to specify their own
{{Charset}} instance?
was:
I'm trying to use {{RegexStringComparator}} in conjunction with {{RowFilter}}.
One of my row keys contained the byte 0xA, which turns out to be the ASCII code
for the newline character (\n). When the row key is converted to a string in
order to use the regexp facility of the Java standard library, it becomes a
string containing two lines and my regexp does not match.
I believe the solution is to compile the regexp with the {{DOTALL}} flag.
Luckily, this flag can be "passed" by the client by prefixing the regexp with
{{(?s)}} so people working with an older version of HBase can work around this
issue without having to upgrade.
Summary: filter.RegexStringComparator does not work with certain bytes
(was: filter.RegexStringComparator does not work in presence of the byte 0xA)
> filter.RegexStringComparator does not work with certain bytes
> -------------------------------------------------------------
>
> Key: HBASE-2323
> URL: https://issues.apache.org/jira/browse/HBASE-2323
> Project: Hadoop HBase
> Issue Type: Bug
> Components: filters
> Affects Versions: 0.20.3
> Reporter: Benoit Sigoure
> Assignee: Benoit Sigoure
>
> I'm trying to use {{RegexStringComparator}} in conjunction with
> {{RowFilter}}. One of my row keys contained the byte 0xA, which turns out to
> be the ASCII code for the newline character (\n). When the row key is
> converted to a string in order to use the regexp facility of the Java
> standard library, it becomes a string containing two lines and my regexp does
> not match.
> I believe the solution is to compile the regexp with the {{DOTALL}} flag.
> Luckily, this flag can be "passed" by the client by prefixing the regexp with
> {{(?s)}} so people working with an older version of HBase can work around
> this issue without having to upgrade.
> Second problem: One of my row keys contained the sequence {{0x00 0x00 0x9D}}
> ({{0x9D}} = -99 when stored in a Java {{byte}}) but in {{compareTo}} the row
> key is transformed in a {{String}} using {{Bytes.toString}}, which just
> assumes that the byte array is an UTF8 encoded string. Java "cleverly"
> substituted the 0x9D byte with 0x63 (character '?'). In my case, I want to
> use encoding ISO-8859-1 as it preserves every byte when the byte array is
> converted to a {{String}} and back to a byte array, unlike UTF-8 or ASCII.
> Should we add a new method to {{RegexStringComparator}} to allow the user to
> specify their own {{Charset}} instance?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.