[
https://issues.apache.org/jira/browse/HADOOP-18521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steve Loughran updated HADOOP-18521:
------------------------------------
Description:
YouTubeAbfsInputStream.close() can trigger the return of buffers used for
active prefetch GET requests into the ReadBufferManager free buffer pool.
A subsequent prefetch by a different stream in the same process may acquire
this same buffer. This can lead to risk of corruption of its own prefetched
data, data which may then be returned to that other thread.
The full analysis in in the document attached to this JIRA.
h2. Emergency fix through site configuration
On releases without the fix for this (3.3.2+), the bug can be avoided by
disabling all prefetching
{code:java}
fs.azure.readaheadqueue.depth = 0
{code}
h2. Automated probes for risk of exposure
The [cloudstore|https://github.com/steveloughran/cloudstore] diagnostics JAR
has a command
[safeprefetch|https://github.com/steveloughran/cloudstore/blob/trunk/src/main/site/safeprefetch.md]
which probes an abfs client for being vulnerable. It does this through
{{PathCapabilities.hasPathCapability()}} probes. It can be invoked on the
command line to validate the version/configuration
Consult [the
source|https://github.com/steveloughran/cloudstore/blob/trunk/src/main/java/org/apache/hadoop/fs/store/abfs/SafePrefetch.java#L96]
to see how to do this programmatically.
Note also that the tool's
[mkcsv|https://github.com/steveloughran/cloudstore/blob/trunk/src/main/site/mkcsv.md]
command can be used to generate the multi-GB CSV files needed to trigger the
condition and so verify that the issue exists.
Note also that the tool's mkcsv command can be used to generate the multi-GB
CSV files needed to trigger the condition and so verify that the issue exists.
h2. Microsoft Announcement
{code:java}
From: Sneha Vijayarajan
Subject: RE: Alert ! ABFS Driver - Possible data corruption on read path
Hi,
One of the contributions made to ABFS Driver has a potential to cause data
corruption on read
path.
Please check if the below change is part of any of your releases:
HADOOP-17156. Purging the buffers associated with input streams during close()
by mukund-thakur
· Pull Request #3285 · apache/hadoop (github.com)
RCA: Scenario that can lead to data corruption:
Driver allocates a bunch of prefetch buffers at init and are shared by
different instances of
InputStreams created within that process. These prefetch buffers could be in 3
stages –
* In ReadAheadQueue : request for prefetch logged
* In ProgressList : Work has begun to talk to backend store to get the
requested data
* In CompletedList: Prefetch data is now available for consumption.
When multiple InputStreams have prefetch buffers across these states and close
is triggered on
any InputStream/s, the commit above will remove buffers allotted to respective
stream from all
the 3 lists and also declare that the buffers are available for new prefetches
to happen, but
no action to cancel/prevent buffer from being updated with ongoing network
request is done.
Data corruption can happen if one such freed up buffer from InProgressList is
allotted to a new
prefetch request and then the buffer got filled up with the previous stream’s
network request.
Mitigation: If this change is present in any release, kindly help communicate
to your customers
to immediately set below config to 0 in their clusters. This will disable
prefetches which can
have an impact on perf but will prevent the possibility of data corruption.
fs.azure.readaheadqueue.depth: Sets the readahead queue depth in
AbfsInputStream. In case the
set value is negative the read ahead queue depth will be set as
Runtime.getRuntime().availableProcessors(). By default the value will be 2. To
disable
readaheads, set this value to 0. If your workload is doing only random reads
(non-sequential)
or you are seeing throttling, you may try setting this value to 0.
Next steps: We are getting help to post the notifications for this in Apache
groups. Work on
HotFix is also ongoing. Will update this thread once the change is checked in.
Please reach out for any queries or clarifications.
Thanks,
Sneha Vijayarajan
{code}
was:
AbfsInputStream.close() can trigger the return of buffers used for active
prefetch GET requests into the ReadBufferManager free buffer pool.
A subsequent prefetch by a different stream in the same process may acquire
this same buffer. This can lead to risk of corruption of its own prefetched
data, data which may then be returned to that other thread.
h2. Emergency fix through site configuration
On releases without the fix for this (3.3.2+), the bug can be avoided by
disabling all prefetching
{code}
fs.azure.readaheadqueue.depth = 0
{code}
Full analysis in attached document.
h2. Microsoft Announcement
{code}
From: Sneha Vijayarajan
Subject: RE: Alert ! ABFS Driver - Possible data corruption on read path
Hi,
One of the contributions made to ABFS Driver has a potential to cause data
corruption on read
path.
Please check if the below change is part of any of your releases:
HADOOP-17156. Purging the buffers associated with input streams during close()
by mukund-thakur
· Pull Request #3285 · apache/hadoop (github.com)
RCA: Scenario that can lead to data corruption:
Driver allocates a bunch of prefetch buffers at init and are shared by
different instances of
InputStreams created within that process. These prefetch buffers could be in 3
stages –
* In ReadAheadQueue : request for prefetch logged
* In ProgressList : Work has begun to talk to backend store to get the
requested data
* In CompletedList: Prefetch data is now available for consumption.
When multiple InputStreams have prefetch buffers across these states and close
is triggered on
any InputStream/s, the commit above will remove buffers allotted to respective
stream from all
the 3 lists and also declare that the buffers are available for new prefetches
to happen, but
no action to cancel/prevent buffer from being updated with ongoing network
request is done.
Data corruption can happen if one such freed up buffer from InProgressList is
allotted to a new
prefetch request and then the buffer got filled up with the previous stream’s
network request.
Mitigation: If this change is present in any release, kindly help communicate
to your customers
to immediately set below config to 0 in their clusters. This will disable
prefetches which can
have an impact on perf but will prevent the possibility of data corruption.
fs.azure.readaheadqueue.depth: Sets the readahead queue depth in
AbfsInputStream. In case the
set value is negative the read ahead queue depth will be set as
Runtime.getRuntime().availableProcessors(). By default the value will be 2. To
disable
readaheads, set this value to 0. If your workload is doing only random reads
(non-sequential)
or you are seeing throttling, you may try setting this value to 0.
Next steps: We are getting help to post the notifications for this in Apache
groups. Work on
HotFix is also ongoing. Will update this thread once the change is checked in.
Please reach out for any queries or clarifications.
Thanks,
Sneha Vijayarajan
{code}
h2. Automated probes for risk of exposure
The [cloudstore|https://github.com/steveloughran/cloudstore] diagnostics JAR
has a command
[safeprefetch|https://github.com/steveloughran/cloudstore/blob/trunk/src/main/site/safeprefetch.md]
which probes an abfs client for being vulnerable. It does this through
{{PathCapabilities.hasPathCapability()}} probes. It can be invoked on the
command line to validate the version/configuration
Consult [the
source|https://github.com/steveloughran/cloudstore/blob/trunk/src/main/java/org/apache/hadoop/fs/store/abfs/SafePrefetch.java#L96]
to see how to do this programmatically.
Note also that the tool's
[mkcsv|https://github.com/steveloughran/cloudstore/blob/trunk/src/main/site/mkcsv.md]
command can be used to generate the multi-GB CSV files needed to trigger the
condition and so verify that the issue exists.
> ABFS ReadBufferManager buffer sharing across concurrent HTTP requests
> ---------------------------------------------------------------------
>
> Key: HADOOP-18521
> URL: https://issues.apache.org/jira/browse/HADOOP-18521
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs/azure
> Affects Versions: 3.3.2, 3.3.3, 3.3.4
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Priority: Critical
> Labels: pull-request-available
> Fix For: 3.3.5
>
> Attachments: HADOOP-18521 ABFS ReadBufferManager buffer sharing
> across concurrent HTTP requests.pdf
>
>
> YouTubeAbfsInputStream.close() can trigger the return of buffers used for
> active prefetch GET requests into the ReadBufferManager free buffer pool.
> A subsequent prefetch by a different stream in the same process may acquire
> this same buffer. This can lead to risk of corruption of its own prefetched
> data, data which may then be returned to that other thread.
> The full analysis in in the document attached to this JIRA.
> h2. Emergency fix through site configuration
> On releases without the fix for this (3.3.2+), the bug can be avoided by
> disabling all prefetching
> {code:java}
> fs.azure.readaheadqueue.depth = 0
> {code}
> h2. Automated probes for risk of exposure
> The [cloudstore|https://github.com/steveloughran/cloudstore] diagnostics JAR
> has a command
> [safeprefetch|https://github.com/steveloughran/cloudstore/blob/trunk/src/main/site/safeprefetch.md]
> which probes an abfs client for being vulnerable. It does this through
> {{PathCapabilities.hasPathCapability()}} probes. It can be invoked on the
> command line to validate the version/configuration
> Consult [the
> source|https://github.com/steveloughran/cloudstore/blob/trunk/src/main/java/org/apache/hadoop/fs/store/abfs/SafePrefetch.java#L96]
> to see how to do this programmatically.
> Note also that the tool's
> [mkcsv|https://github.com/steveloughran/cloudstore/blob/trunk/src/main/site/mkcsv.md]
> command can be used to generate the multi-GB CSV files needed to trigger the
> condition and so verify that the issue exists.
> Note also that the tool's mkcsv command can be used to generate the multi-GB
> CSV files needed to trigger the condition and so verify that the issue exists.
> h2. Microsoft Announcement
> {code:java}
> From: Sneha Vijayarajan
> Subject: RE: Alert ! ABFS Driver - Possible data corruption on read path
> Hi,
> One of the contributions made to ABFS Driver has a potential to cause data
> corruption on read
> path.
> Please check if the below change is part of any of your releases:
> HADOOP-17156. Purging the buffers associated with input streams during
> close() by mukund-thakur
> · Pull Request #3285 · apache/hadoop (github.com)
> RCA: Scenario that can lead to data corruption:
> Driver allocates a bunch of prefetch buffers at init and are shared by
> different instances of
> InputStreams created within that process. These prefetch buffers could be in
> 3 stages –
> * In ReadAheadQueue : request for prefetch logged
> * In ProgressList : Work has begun to talk to backend store to get the
> requested data
> * In CompletedList: Prefetch data is now available for consumption.
> When multiple InputStreams have prefetch buffers across these states and
> close is triggered on
> any InputStream/s, the commit above will remove buffers allotted to
> respective stream from all
> the 3 lists and also declare that the buffers are available for new
> prefetches to happen, but
> no action to cancel/prevent buffer from being updated with ongoing network
> request is done.
> Data corruption can happen if one such freed up buffer from InProgressList is
> allotted to a new
> prefetch request and then the buffer got filled up with the previous stream’s
> network request.
> Mitigation: If this change is present in any release, kindly help communicate
> to your customers
> to immediately set below config to 0 in their clusters. This will disable
> prefetches which can
> have an impact on perf but will prevent the possibility of data corruption.
> fs.azure.readaheadqueue.depth: Sets the readahead queue depth in
> AbfsInputStream. In case the
> set value is negative the read ahead queue depth will be set as
> Runtime.getRuntime().availableProcessors(). By default the value will be 2.
> To disable
> readaheads, set this value to 0. If your workload is doing only random reads
> (non-sequential)
> or you are seeing throttling, you may try setting this value to 0.
> Next steps: We are getting help to post the notifications for this in Apache
> groups. Work on
> HotFix is also ongoing. Will update this thread once the change is checked in.
> Please reach out for any queries or clarifications.
> Thanks,
> Sneha Vijayarajan
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]