[jira] [Resolved] (HBASE-27637) Zero length value would cause value compressor read nothing and not advance the position of the InputStream

2023-02-14 Thread Duo Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang resolved HBASE-27637.
---
Fix Version/s: 2.6.0
   3.0.0-alpha-4
   2.5.4
 Hadoop Flags: Reviewed
   Resolution: Fixed

Pushed to branch-2.5+.

Thanks all for helping and reviewing!

> Zero length value would cause value compressor read nothing and not advance 
> the position of the InputStream
> ---
>
> Key: HBASE-27637
> URL: https://issues.apache.org/jira/browse/HBASE-27637
> Project: HBase
>  Issue Type: Bug
>  Components: dataloss, wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.5.4
>
>
> This is a code sniff from the discussion of HBASE-27073
> {code}
>   public static void main(String[] args) throws Exception {
> CompressionContext ctx =
>   new CompressionContext(LRUDictionary.class, false, false, true, 
> Compression.Algorithm.GZ);
> ValueCompressor compressor = ctx.getValueCompressor();
> byte[] compressed = compressor.compress(new byte[0], 0, 0);
> System.out.println("compressed length: " + compressed.length);
> ByteArrayInputStream bis = new ByteArrayInputStream(compressed);
> int read = compressor.decompress(bis, compressed.length, new byte[0], 0, 
> 0);
> System.out.println("read length: " + read);
> System.out.println("position: " + (compressed.length - bis.available()));
> {code}
> And the output is
> {noformat}
> compressed length: 20
> read length: 0
> position: 0
> {noformat}
> So it turns out that, when compressing, an empty array will still generate 
> some output bytes but while reading, we will skip reading anything if we find 
> the output length is zero, so next time when we read from the stream, we will 
> start at a wrong position...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Deprecated the 'hbase.regionserver.hlog.reader.impl' and 'hbase.regionserver.hlog.writer.impl' configurations

2023-02-14 Thread 唐天航
Can we keep these configs and add a new one for the replication reader?

As @Andrew said, One of the things we are doing is using BookKeeper for WAL
storage. This depends on the configuration above. Although we are
developing based on branch-1, and it is too early to talk about joining the
community, I'm not sure what the attitude of the community is, whether it
is willing to accept implementations based on other storage in the future.

Thanks.


张铎(Duo Zhang)  于2023年2月14日周二 21:20写道:

> Thanks Andrew for the feedback.
>
> If no other concerns, I will start the refactoring work and deprecated
> these two configs.
>
> Thanks.
>
> Andrew Purtell  于2023年2月10日周五 23:01写道:
>
> > As you point out these configuration settings were introduced when we
> > migrated from SequenceFile based WALs to the protobuf format. We needed
> to
> > give users a way to manually migrate, although, arguably, an auto
> migration
> > would have been better.
> >
> > In theory these settings allow users to implement their own WAL readers
> > and writers. However I do not believe users will do this. The WAL is
> > critical for performance and correctness. If anyone is contemplating such
> > wizard level changes they can patch the code themselves. It’s fine to
> > document these settings as deprecated for sure, and I think ok also to
> > claim them unsupported and ignored.
> >
> > >
> > > On Feb 10, 2023, at 3:41 AM, 张铎  wrote:
> > >
> > > While discussing how to deal with the problem in HBASE-27621, we
> > proposed
> > > to introduce two types of WAL readers, one for WAL splitting, and the
> > other
> > > for WAL replication, as replication needs to tail the WAL file which is
> > > currently being written, so the logic is much more complicated. We do
> not
> > > want to affect WAL splitting logic and performance while tweaking the
> > > replication related things, as all HBase users need WAL splitting but
> not
> > > everyone needs replication.
> > >
> > > But when reviewing the related code, I found that we have two
> > > configurations for specifying the WAL reader class and WAL write class,
> > > which indicates that we could only have one implementation for the WAL
> > > reader. They are 'hbase.regionserver.hlog.reader.impl' and
> > > 'hbase.regionserver.hlog.writer.impl'.
> > >
> > > We mentioned these two configurations several times in our ref guide.
> > >
> > > HBase 2.0+ can no longer read Sequence File based WAL file.
> > >
> > > HBase can no longer read the deprecated WAL files written in the Apache
> > >> Hadoop Sequence File format. The hbase.regionserver.hlog.reader.impl
> and
> > >> hbase.regionserver.hlog.writer.impl configuration entries should be
> set
> > to
> > >> use the Protobuf based WAL reader / writer classes. This
> implementation
> > has
> > >> been the default since HBase 0.96, so legacy WAL files should not be a
> > >> concern for most downstream users.
> > >
> > >
> > > Configure WAL encryption.
> > >
> > > Configure WAL encryption in every RegionServer’s hbase-site.xml, by
> > setting
> > >> the following properties. You can include these in the HMaster’s
> > >> hbase-site.xml as well, but the HMaster does not have a WAL and will
> not
> > >> use them.
> > >> 
> > >>  hbase.regionserver.hlog.reader.impl
> > >>
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.SecureProtobufLogReader
> > >> 
> > >> 
> > >>  hbase.regionserver.hlog.writer.impl
> > >>
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.SecureProtobufLogWriter
> > >> 
> > >> 
> > >>  hbase.regionserver.wal.encryption
> > >>  true
> > >> 
> > >
> > >
> > > So in fact, do not consider encryption, the configurations are useless
> as
> > > we do not support reading sequence file format WAL any more, the only
> > valid
> > > options are protobuf based reader and write. And for security, I think
> > the
> > > configuration is redundant as if encryption is enabled, we should use
> > > SecureProtobufLogWriter for writing, no matter what the configuration
> > value
> > > is. And for readers, I do not think we should use a configuration to
> > > specify the implementation, we should detect whether the file is
> > encrypted
> > > and choose a secure or normal reader to read the file.
> > >
> > > So here, I propose we just deprecated these two configurations because
> > they
> > > are useless now.
> > >
> > > Thanks.
> >
>


[jira] [Created] (HBASE-27642) Expose master startup status via JMX

2023-02-14 Thread Xiaolin Ha (Jira)
Xiaolin Ha created HBASE-27642:
--

 Summary: Expose master startup status via JMX
 Key: HBASE-27642
 URL: https://issues.apache.org/jira/browse/HBASE-27642
 Project: HBase
  Issue Type: Improvement
  Components: master
Reporter: Xiaolin Ha


As described in HBASE-21521 by [~apurtell] , 

add an internal API to the master for tracking startup progress. Expose this 
information via JMX.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27641) Verify replication excessive false positive bad rows

2023-02-14 Thread Hernan Gelaf-Romer (Jira)
Hernan Gelaf-Romer created HBASE-27641:
--

 Summary: Verify replication excessive false positive bad rows
 Key: HBASE-27641
 URL: https://issues.apache.org/jira/browse/HBASE-27641
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce, Replication
Reporter: Hernan Gelaf-Romer
Assignee: Hernan Gelaf-Romer


Verify replication can generate a lot of `BADROWS` results when comparing a row 
that may be particularly hot at the time of re-compare. This can lead to a 
mismatch between the source and sink result if due to replication lag. 

We could add some configurable re-compare mechanism that will make verify 
replication less susceptible to falsely reporting `BADROWS` when under 
significant write load. These re-compares can be done asynchronously so as to 
not significantly slow down the execution time of the job.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27640) Optimize writes of zero length values in compressed WALs

2023-02-14 Thread Andrew Kyle Purtell (Jira)
Andrew Kyle Purtell created HBASE-27640:
---

 Summary: Optimize writes of zero length values in compressed WALs
 Key: HBASE-27640
 URL: https://issues.apache.org/jira/browse/HBASE-27640
 Project: HBase
  Issue Type: Sub-task
Reporter: Andrew Kyle Purtell
 Fix For: 2.6.0, 3.0.0-alpha-4


If we unconditionally use the compressor, to "write" 0 bytes, then the 
compression codec will emit overheads... hadoop compressionstream header, 
compression bitstream header. All of that should be skipped so truly no 
compressed value data is written when the value is empty. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Deprecated the 'hbase.regionserver.hlog.reader.impl' and 'hbase.regionserver.hlog.writer.impl' configurations

2023-02-14 Thread Duo Zhang
Thanks Andrew for the feedback.

If no other concerns, I will start the refactoring work and deprecated
these two configs.

Thanks.

Andrew Purtell  于2023年2月10日周五 23:01写道:

> As you point out these configuration settings were introduced when we
> migrated from SequenceFile based WALs to the protobuf format. We needed to
> give users a way to manually migrate, although, arguably, an auto migration
> would have been better.
>
> In theory these settings allow users to implement their own WAL readers
> and writers. However I do not believe users will do this. The WAL is
> critical for performance and correctness. If anyone is contemplating such
> wizard level changes they can patch the code themselves. It’s fine to
> document these settings as deprecated for sure, and I think ok also to
> claim them unsupported and ignored.
>
> >
> > On Feb 10, 2023, at 3:41 AM, 张铎  wrote:
> >
> > While discussing how to deal with the problem in HBASE-27621, we
> proposed
> > to introduce two types of WAL readers, one for WAL splitting, and the
> other
> > for WAL replication, as replication needs to tail the WAL file which is
> > currently being written, so the logic is much more complicated. We do not
> > want to affect WAL splitting logic and performance while tweaking the
> > replication related things, as all HBase users need WAL splitting but not
> > everyone needs replication.
> >
> > But when reviewing the related code, I found that we have two
> > configurations for specifying the WAL reader class and WAL write class,
> > which indicates that we could only have one implementation for the WAL
> > reader. They are 'hbase.regionserver.hlog.reader.impl' and
> > 'hbase.regionserver.hlog.writer.impl'.
> >
> > We mentioned these two configurations several times in our ref guide.
> >
> > HBase 2.0+ can no longer read Sequence File based WAL file.
> >
> > HBase can no longer read the deprecated WAL files written in the Apache
> >> Hadoop Sequence File format. The hbase.regionserver.hlog.reader.impl and
> >> hbase.regionserver.hlog.writer.impl configuration entries should be set
> to
> >> use the Protobuf based WAL reader / writer classes. This implementation
> has
> >> been the default since HBase 0.96, so legacy WAL files should not be a
> >> concern for most downstream users.
> >
> >
> > Configure WAL encryption.
> >
> > Configure WAL encryption in every RegionServer’s hbase-site.xml, by
> setting
> >> the following properties. You can include these in the HMaster’s
> >> hbase-site.xml as well, but the HMaster does not have a WAL and will not
> >> use them.
> >> 
> >>  hbase.regionserver.hlog.reader.impl
> >>
> >>
> org.apache.hadoop.hbase.regionserver.wal.SecureProtobufLogReader
> >> 
> >> 
> >>  hbase.regionserver.hlog.writer.impl
> >>
> >>
> org.apache.hadoop.hbase.regionserver.wal.SecureProtobufLogWriter
> >> 
> >> 
> >>  hbase.regionserver.wal.encryption
> >>  true
> >> 
> >
> >
> > So in fact, do not consider encryption, the configurations are useless as
> > we do not support reading sequence file format WAL any more, the only
> valid
> > options are protobuf based reader and write. And for security, I think
> the
> > configuration is redundant as if encryption is enabled, we should use
> > SecureProtobufLogWriter for writing, no matter what the configuration
> value
> > is. And for readers, I do not think we should use a configuration to
> > specify the implementation, we should detect whether the file is
> encrypted
> > and choose a secure or normal reader to read the file.
> >
> > So here, I propose we just deprecated these two configurations because
> they
> > are useless now.
> >
> > Thanks.
>


[jira] [Resolved] (HBASE-27630) hbase-spark bulkload stage directory limited to hdfs only

2023-02-14 Thread Peter Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Somogyi resolved HBASE-27630.
---
Resolution: Fixed

Merged to master branch in the hbase-connectors repository. Thanks for the 
patch [~sergey.soldatov]!

> hbase-spark bulkload stage directory limited to hdfs only
> -
>
> Key: HBASE-27630
> URL: https://issues.apache.org/jira/browse/HBASE-27630
> Project: HBase
>  Issue Type: Bug
>  Components: spark
>Affects Versions: connector-1.0.0
>Reporter: Sergey Soldatov
>Assignee: Sergey Soldatov
>Priority: Major
> Fix For: hbase-connectors-1.1.0
>
>
> It's impossible to set up the staging directory for bulkload operation in 
> spark-hbase connector to any other filesystem different from hdfs. That might 
> be a problem for deployments where hbase.rootdir points to cloud storage. In 
> this case, an additional copy task from hdfs to cloud storage would be 
> required before loading hfiles to hbase.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)